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1  Overview 

During  the  last  3  years  we  have  developed  a  mathematical  theory  of  algorithms  and  implementa¬ 
tion  strategies  for  DSP  computations  on  RISC  and  DSP  chips  and  parallel  architectures  ranging 
from  scalable  multinode  boards  to  massively  parallel  multinode  computers  as  typified  by  the 
Intel’s  Touchstone  systems. 

Recently,  our  work  has  centered  around  implementation  of  the  DFT,  convolution  and  wavelet 
multirate  filter  systems  on  distributed  parallel  computing  platforms,  and  embedding  of  the  rou¬ 
tines  in  various  applications  in  collaboration  with  several  government  laboratories,  commercial 
institutions  and  university  research  groups. 

The  general  goal  of  this  effort  is  to  establish  tools  which  apply  concurrently  to  software  and 
hardware  and  create 

•  a  technology  base  for  developing  optimal  software,  extending  the  life  span  of  software  by 
appropriately  targeting  suitable  hardware. 

•  procedures  for  cost  effective  system  design  for  special  purpose  architectures  which  can  be 
expected  to  efiiciently  implement  a  whole  class  of  similar  algorithms  of  interest. 

•  immediate  utilization  of  new  hardware  advances  at  minimal  time  and  cost  in  software 
development. 

The  director  of  the  group  is  Richard  Tolimieri  who  is  partially  supported  by  the  contract. 
The  contract  also  supports  Myoung  An  fuU  time,  Chao  Lu  of  Towson  University  as  a  consultant, 
and  three  graduate  students  two  of  whom  have  received  PhD  during  this  period. 

One  feature  of  our  approach  is  that  algorithms  are  modeled  in  algebraic  terms  permitting 
software  to  be  optimized  by  algebraic  manipulations  as  oppose  to  more  time-consuming  pro¬ 
gramming  manipulations.  This  algebra  identifies  and  operates  on  fundamental  computational 
and  communication  primitives  which  concurrently  model  software  and  machine  parameters  and 
establishes  interactive  programming  tools  in  the  form  of  transformation  rules  for  selecting  highly 
optimized  code  for  a  target  architecture. 

We  have  developed  a  theory  of  algorithms  for  DSP  computations  based  on  finite  abelian 
group  theory  that  divorces  the  problem  of  algorithm  and  system  design  from  the  particulars  of 
implementation  and  appHcation  and  has  led  to  the  development  of  new  algorithms  which  present 
radically  different  communication  paths  and  data  structures  for  sub  computations.  This  is  espe¬ 
cially  important  in  multidimensional  processing  which  incorporates  more  degrees  of  freedom  for 
system  and  algorithm  design  but  involves  data  sizes  that  challenge  hardware  memory  resources, 
I/O  and  interprocessor  communication  bandwidth.  In  this  framework,  new  algorithms  have  been 
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designed  for  incorporating  special  data  characteristics  (real,  hernaitian,  space  group  symmetric) 
and  for  embedding  code  in  applications  highlighting  special  local  data  characteristics.  Typically 
such  applications  involve  iteration  of  distinct  computations  where  standard  algorithms  result  in 
a  mismatch  between  input  and  output  data  structures  of  successive  stages. 

These  tools  have  and  will  significantly  impact  computations  in  such  diverse  application  areas 
as  image  processing,  x-ray  crystallography,  communications,  computational  fluid  dynamics  and 
computational  electromagnetics. 

We  have  applied  our  results  summarized  below  in  collaboration  with  government  agencies, 
universities  and  commercial  institutions. 
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1.1  Applied  Results 


goals : 

accomplishments: 

•  Develop  a  theory  for  data  partition  and  mi¬ 
gration  on  shared  and  distributed  memory  mul¬ 
tiprocessors. 

•  Formulation  of  data  partitioning  and  migra¬ 
tion  schemes  in  terms  of  tensor  product  algebra. 

•  Implementation  of  the  theory  developed  for 
data  partitioning  and  migration  in  parallel  solu¬ 
tions  for  applications. 

•  Implementation  of  routines  to  interface  various 
data  partitioning  in  distributed  computing  sys¬ 
tems  for  general  numerical  procedures  involving 
sequences  of  computations  requiring  intermedi¬ 
ate  data  redistribution. 

•  Implementation  of  matrix  multiplication  using 
the  theory  to  change  the  data  flow  from  existing 
matrix  multiplication  algorithms. 

•  Interface  multidimensional  FFT  for  the 
wavelet- Galerkin  and  capacitance  matrix  meth¬ 
ods  for  the  solutions  of  Euler  and  Navier-Stokes 

equations. 

•  Improve  the  efficiency  of  Intel’s  multidimen¬ 
sional  FFT  library. 

•Interleaved  communication  and  computation  in 
the  3D  FFT,  along  with  the  use  of  efficient  vec¬ 
torized  assembly  FFT  codes  improves  the  3D 
FFT  code  up  to  50  %. 

•  Tensor  product  formulation  of  the  2D  FFT  al¬ 
lows  for  majcimizing  the  degree  of  concurrency 
between  computations  of  row  ID  FFTs  and 
global  transposition  to  result  in  up  to  40  %  faster 
codes. 
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goals: 

ac  CO  mp  lis  h  me  nt  s : 

•  Create  scalable  ID  and  2D  power  of  2  and  com¬ 
posite  transform  size  parallel  DFT  library  using 
reduced  interprocessor  communication  variants 
of  MD  Cooley-Tukey  and  Good-Thomas  algo¬ 
rithms. 

•  A  family  of  M-D  implementations  improving 
performance  up  to  200%  over  powers  of  2  Intel 

2D  and  3D  code. 

•  Create  a  scalable  ID  and  2D  composite 
transform  size  parallel  DFT  library  on  the  In¬ 
tel  IPSC/860  based  on  standard  row-column 

method. 

•a  scalable  library  of  composite  size  ID  and  2D 
parallel  DFT  implementations  with  CPU  com¬ 
patible  with  nlogn  criteria. 

•  Create  a  scalable  2D  and  3D  library  of  parallel 
DFT  codes  based  on  the  vector-radix  algorithm 
and  compare  their  performance  with  the  row- 
column  approach. 

•  A  scalable  library  of  2D  and  3D  vector-radix 
implementations  along  with  a  comparison  with 
row-column  implementations  and  identification 
of  cases  where  vector-radix  outperforms  row- 

column  method. 

•  Create  a  scalable  library  of  efficient  non- 
powers-of-two  parallel  DFT  codes  with  re¬ 
duced  inter-processor  communication  needs,  us¬ 
ing  variants  of  the  RTA  algorithm. 

•  A  family  of  RTA  variants  implementations  im¬ 
proves  the  performance  of  the  parallel  DFT  up 
to  75  %  over  the  powers-of-two  Intel  2D  and  3D 

FFT  code. 

•  Investigate  the  suitability  of  the  parallel  al¬ 
gorithms  we  proposed  for  other  parallel  multi¬ 
processor  systems  (Clusters  of  workstations). 

•  Parallel  RTA  variants  coded  to  run  on  a  cluster 
of  SUN  workstations  show  promising  speedup 
and  scalability  features. 
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goals: 

accomplishments: 

•  Develop  of  a  library  of  parallel  symmetrized 

DFT  codes. 

•  Derivation  of  a  novel  symmetrized  DFT  al¬ 
gorithm  based  on  group  theoretic  concepts,  im- 
plementable  on  multi-processor  machines,  with 
a  wide  range  of  applications  in  crystallography 
and  signal  processing. 

•  Investigate  integer  and  rationally  oversam¬ 
pled  Weyl-Heisenberg  coefficient  computation  in 
a  distributed  memory  multiprocessor  environ¬ 
ment. 

•  A  library  of  real  time  implementation  of  inte¬ 
ger  and  rationally  oversampled  Weyl-Heisenberg 
coefficient  computation  on  single  i860  processor 
and  on  4-  and  8-node  computing  systems. 

•  Documentation  of  employed  methods  of  design 
and  implementation  of  parallel  algorithms  in  the 
most  widely  available  form  for  the  purpose  of 
immediate  availability  by  the  pubMc. 

•  In  addition  to  publication  of  Mathematics 
of  Multidimensional  Fourier  Transform  Algo- 
rithms,  Springer- Verlag  textbook,  several  papers 
to  journals  have  been  submitted  and  presenta¬ 
tions  were  given  at  conferences. 

•  Porting  of  Touchstone  parallel  codes  to  other 
parallel  architectures  as  a  test  of  portability  of 

our  methods. 

•  The  parallel  FFT  codes  have  been  successfully 
ported  to  the  IBM  SP2  multiprocessor  system  of 
the  NAS  NASA  Research  Center  in  less  than  a 

day. 
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These  applied  results  have  led  to  the  following  technology  transfers. 

1.2  Technology  transfer 

•  David  Grimm,  Honeywell,  Inc.,  813  539  4213 
Embeddable  Multiprocessor  Systems 

We  have  ported  scalable,  multiprocessor,  multidimensional  FFT  routine  for  variable  size 
PARAGON  systems.  Honeywell  has  agree  to  act  as  /3-site  for  the  codes  and  the  given 
machine  environment  for  the  codes  we  have  developed. 

•  A.  King,  Intel  Corporation,  Supercomputer  Systems  Division,  503  531  5300. 

-  i860 

For  the  Intel  i860,  we  have  developed  a  library  of  mixed  size  FFT  routines,  which 
wiU  soon  be  available  in  the  commercial  market.  The  Ubrary  is  three  times  denser  in 
transform  sizes  than  existing  such  libraries.  The  non-powers-of-two  sizes  run  at  the 
linear  time  scale  as  the  powers-of-two  sizes  which  run  competitively  with  assembly 
coded  fuUy  optimized  routines  in  other  libraries. 

-  Touchstone  Systems,  DELTA,  iPS C/860,  PARAGON 

For  the  Intel  Touchstone  systems,  we  have  implemented  scalable,  multiprocessor, 
multidimensional  FFT  routines  optimized  for  each  of  the  three  systems. 

•  E.  Prince,  Reactor  Division,  NIST,  301  970  6230. 

X-ray  crystallographic  FFT  routines. 

SUN,  Microways’s  NumberSmasher860  accelerator  card. 

We  worked  with  Dr.  Edward  Prince  of  NIST  to  embed  our  crystallographic  group  specific 
mixed  size  FFT  library.  For  computational  methods  in  X-ray  crystallography,  mixed  size 
FFT  routines  are  crucial.  Library  was  created  in  collaboration  with  Dr.  Prince  to  address 
the  most  appUcable  computations  for  compile-time  efficiency.  During  our  coUaboration, 
Dr.  Prince  has  changed  his  computing  environments  three  times,  VAX,  486  PC  and 
most  recently  added  i860  accelerator  card  for  compute  intensive  procedures.  In  each  of 
the  computing  environments,  our  codes  have  significantly  improved  (3  -  -  100  times)  the 
runtime  of  the  computations. 

•  J.  Weiss,  Aware,  Inc.,  617  577  1700. 

Computational  Fluid  Dynamics 
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2D  FFT  on  Intel’s  Touchstone  systems 

We  supplied  Dr.  Weiss  of  Aware,  Inc.  with  data-restructuring  routines  for  Intel’s  Delta 
machine  for  his  parallel  methods  for  incompressible  Euler  and  Navier-Stokes  equations 
for  fluid  dynamics  in  two-space.  While  parallelization  of  other  computational  procedures 
required  non-traditional  data  structures,  parallel  optimized  FFT  routines  are  available  only 
for  row-column  distributed  data  structure.  Our  data  restructuring  routines  are  formulated 
in  terms  of  global/local  stride  permutations,  and  embeddable  in  row-column  distributed 
FFT  routines.  In  fact,  we  have  improved  the  global  FFT  routines  by  120-200%. 

•  C.  Lund,  Mercury  Computer  Systems,  Inc.,  508  256  1300. 

Mercury’s  MCV6 

We  are  working  to  port  our  parallel  multidimensional  FFT  routines  and  Weyl-Heisenberg 
coeflRcient  computation  routines  to  Mercury’s  four-node  board. 

•  C.  Giacovazzo,  Departmento  Geominer alogico.  Campus  Univarsitario,  Bari,  Italia,  39  80 
544  2590. 

-  SUN 

We  are  developing  optimized  cubic-symmetry-specific  FFT  code  for  Dr.  Giacovazzo 
of  University  of  Bari,  Italy. 

-  i860-based  multiprocessor  boards. 

We  are  parallelizing  Dr.  Giacovazzo ’s  software  package  for  small  molecule  direct 
methods,  SIR92,  for  an  i860-based  multiprocessor  boards. 

•  A.  Woo,  NASA  AMES  Research  Center,  415  604  6010. 

Computational  Electromagnetics. 

Intel’s  PARAGON,  IBM’s  SP2. 

We  have  ported  mixed-size  3-dimensional  parallel  FFT  code  for  Intel  Paragon  and  IBM’s 
multiprocessor  SP2  for  applications  in  computational  electromagnetics. 

•  G.  TenniUe,  NASA  Langley  Research  Center,  804  864  5786. 

Intel’s  PARAGON,  IBM’s  SP2. 

We  have  transferred  multi-dimensional  double  precision  FFT  routines  for  the  Intel  PARAGON 
and  in  the  process  of  transferring  similar  codes  on  IBM’s  SP2. 
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•  E.  Bleszynski,  Rockwell  International  Corporation,  North  American  Aircraft  Operations, 
310  647  3675. 

Adaptive  Integral  Method  Solver  of  Large  Scale  Electromagnetics  Computations. 

Intel’s  Paragon 

We  have  transferred  a  package  of  scalable  mixed-radix  3-dimensional  EFT  routines  for 
Paragon  nodes. 

•  E.  Holbert,  Kirtland  AFB,  505  846  1995. 

SUN 

We  have  transferred  FT  routines  of  sizes  1000  and  1024  for  real  data  sequences  optimized 
for  the  SUN. 

•  R.  Pachter,  Wright  Laboratory,  WPAFB,  OH,  513  255  6652. 

Intel’s  Paragon 

Materials  Science 

We  have  ported  real/Hermitian  2-dimensional  parallel  FFT  routines  to  the  material  science 
division  for  Paragon. 

•  R.  Martino,  Department  of  Computer  Science  and  Engineering,  NIH,  301  496  1111. 

Intel’s  iPS C/860 

Molecular  Dynamics 

We  have  ported  3-dimensional  parallel  FFT  routines  for  iPS  C/860  128  node  hypercube. 
The  performance  recorded  by  NIH  of  our  code  was  1.2Gbyte  running  FFT  which  is  highest 
recorded  to.  our  knowledge. 

•  Steven  Fried,  Microway,  Inc.  508  585  1277. 

During  the  last  year  we  have  actively  collaborated  with  Microway,  Inc.  to  produce  a 
library  for  their  i860  accelerator  card,  that  is  three  times  denser  in  transform  sizes  than 
existing  such  libraries.  This  library  was  ported  to  Dr.  E.  Prince  of  NIST  for  interface 
with  his  crystallographic  procedures  and  resulted  in  a  speed-up  of  twenty  times.  It  wiU 
soon  be  commercially  available  through  Microway,  Inc.  Presently,  we  are  collaborating  on 
producing  scalable,  mixed-radix,  parallel  FFT  library  for  the  quadputer  i860  board.  We 
have  access  to  Microway’s  hardware  products  in  these  joint  efforts. 
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•  M.  Tzannes,  Aware,  Inc.,  617  577  1700 
Tele- communication  project. 

—  The  converting  fixed-point  input  to  floating-point  stage  was  combined  with  the  inverse 
512-point  FFT  to  save  arithmetic  operations  to  produce  optimal  code  on  ADSP21020. 

_  The  Frequency-Equalizer  stage  was  combined  with  the  512-point  FFT  to  save  arith¬ 
metic  operations  to  produce  optimized  code  on  ADSP21020. 

-  512-point  DCT  II  and  IV  (Discrete  Cosine  Transform)  have  been  optimized  for  Ana¬ 
log  Devices’s  ADSP-21020  chip  based  on  FFT  and  will  be  optimized  on  the  new 
ADSP21060. 

-  512-point  DWMT  (Discrete  Wavelet  Multitone  Technique)  modulator  and  demodu¬ 
lator  based  on  DCT  IV  has  been  optimized  on  ADSP-21020,  and  will  be  optimized 
on  the  new  ADSP21060. 

•  Loral  Federal  Systems  Inc.  —  Benchmark  on  the  IBM  SP2  project. 

-  QUICK-CPF.F 

-  QUICK-CPF.COMPRESSED.F 

-  IPF.F 

-  QUICKJPF.F 

the  above  4  routines  were  optimized  on  IBM  SP2  parallel  systems,  with  about  30%  im- 
provement. 

•  Atlantic  Aerospace  —  FIR  filter  on  ISP  multi-processor  board  based  on  TI  TMS320C40 
chip. 

-  4-point  FIR  filter 

-  8-point  FIR  filter 
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2  Methodologies 

2.1  Data  Partition  and  Migration  on  Distributed  Memory  Multiprocessors 

Problem:  Efficiency  of  parallel  implementation  depends  on  the  implementation  of  the  data  move¬ 
ments  that  describe  the  required  communication,  since  the  overhead  in  distributed  comput¬ 
ing  is  in  the  required  communication  between  processors.  Although  there  are  algorithms 
which  address  the  complexity  in  data  flow,  in  addition  to  arithmetic  complexity,  there  is 
lack  in  unified  methodology  for  analyzing  and  designing  the  data  movements. 

Approach:  To  present  a  formal  methodology  for  the  process  of  data  distribution  and  redistribution 
using  tensor  products  and  stride  permutations  a.s  tools.  The  algebraic  expressions  rep¬ 
resenting  data  partition  and  migration  directly  operate  on  data  vectors,  hence  can  be 
immediately  embedded  into  an  algorithm. 

Goals:  To  implement  and  embed  data  partition  and  migration  algorithms. 

Applications:  General  numerical  solutions  that  require  successive  stages  of  computation  and  data  redis¬ 
tributions. 

results:  A  unique  data  distribution  technique  that  efltectively  uses  transpose  algorithms  for  multi¬ 
plication  of  two  rectangular  matrices  is  derived  and  implemented.  Performance  of  these 
algorithms  are  evaluated  by  carrying  out  implementations  on  Intel’s  i860  based  iPSC/860, 
Touchstone  Delta,  Gamma,  and  Paragon  supercomputers. 

Implemented  the  data  redistribution  algorithm  for  Euler  partial  differential  equation  (PDE) 
for  two-dimensional  case  using  wavelet- Galer kin  method,  where  the  two  most  important 
computation  modules  in  this  solution  require  two  different  data-partitions  for  their  optimal 
implementation.  Results  of  implementation  on  overall  performance  is  included. 
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2.2  Efficient  Multidimensional  DFT  Module  Implementation  on  the  INTEL 
i860  Processor 

Problems:  A  standard  method  [2]  of  implementing  non-power-of-2  transform  size  DFT  is  zero-padding. 

In  multi-dimensional  DFT  computation,  this  will  increase  the  transform  size  dramatically, 
not  only  slowing  down  the  computation  but  also  causing  cache  thrash  and  memory  over¬ 
flow.  In  the  case  of  the  parallel  computer  iPSC/860,  each  node  processor  has  8M  byte 
memory.  If  the  size  of  complex  data  to  be  processed  is  72  x  72  x  72  =  373, 248,  computa¬ 
tion  is  made  in  the  local  memory  of  the  processing  unit  without  data  segmentation.  On 
the  other  hand,  by  padding  with  zeros,  the  size  of  complex  data  to  be  processed  will  be 
128  X  128  X  128  =  2, 097, 152,  which  is  beyond  the  capacity  of  local  memory;  segmentation 
and  data  loading  in  and  out  will  cause  severe  problem. 

Approach:  By  formulating  various  DFT  algorithms  in  the  language  of  tensor  products,  any  large  size 
Fourier  transform  is  built  up  by  a  collection  of  small  size  DFT  modules  which  include  as 
parameters  decimation  step  sizes  and  twiddle  factors.  These  parameters  are  introduced 
in  the  DFT  modules  to  take  advantage  of  modern  computer  architectures  with  parallel, 
pipelined,  multi-functional  structures,  while  providing  flexibility  into  the  building  blocks. 

Our  library  of  core  computation  modules  has  the  following  features: 

—  We  have  efficiently  implemented  prime  factors  3,  5,  7,  11,  13,  17  as  weU  as  powers  of 
2.  Thus,  transform  size  on  each  dimension  of  a  multi-dimensional  Fourier  transform 
can  have  factors  other  than  2. 

-  One-dimensional  small  modules  take  advantage  of  vector  operations  on  i860  by  loop¬ 
ing  on  other  factors  of  the  same  dimension  and  other  dimensions. 

-  One-dimensional  small  modules  have  pre- calculated  twiddle  factor  array  as  a  param¬ 
eter.  This  provides  for  intermediate  stages  of  Cooley-Tukey  FFT  implementation. 

Goals:  To  create  a  scalable  DFT  library  on  the  Intel  i860  with  mixed  radix  transform  sizes  with 
CPU  time  comparable  to  that  of  closest  to  a  power  of  2  transform  size. 

Results:  Timing  results  of  some  sample  medium  size  of  2-dimensional  DFT  modules  with  prime 
factor  on  each  dimension  is  provided  on  the  Intel  i860  processor.  The  results  of  comparable 
power  of  2  FFT  package  [6]  that  are  commercially  available  are  also  included. 
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2.3  Efficient  parallel  implementation  of  traditional  FFT  codes 

Problem:  Data  partition  and  migration  for  efficient  communication  in  distributed  memory  architec¬ 
tures  are  critical  for  permance  of  data  parallel  programs. 

Approach:  Data  partition  and  migration  for  efficient  communication  in  distributed  memory  archi¬ 
tectures  are  critical  for  performance  of  data  parallel  programs.  This  research  presents  a 
formal  methodology  for  the  process  of  data  distribution  and  redistribution  using  tensor 
products  and  stride  permutations  as  mathematical  tools.  The  algebraic  expressions  rep¬ 
resenting  data  partition  and  migration  directly  operate  on  a  data  vector,  and  hence  can 
be  conveniently  embedded  into  an  algorithm.  It  is  also  shown  that  these  expressions  are 
useful  for  a  clear  understanding  and  to  efficiently  interleave  problems  that  involve  different 
data  distributions  at  different  phases.  This  compatibihty  made  us  successfuUy  utilize  these 
expressions  in  developing  and  demonstrating  matrix  transpose  and  fast  Fourier  transform 
algorithms.  An  endeavor  to  minimize  communication  cost  using  expressions  for  data  dis¬ 
tribution  disclosed  routing  scheme  for  Fourier  transform  evaluation.  Results  promised  that 
for  large  paraUel  machines,  this  scheme  is  a  solution  to  today’s  problems  which  feature 
enormous  data.  Finally,  a  unique  data  distribution  technique  that  effectively  uses  trans¬ 
pose  algorithms  for  multiplication  of  two  rectangular  matrices  is  derived.  Performance 
of  these  algorithms  are  evaluated  by  carrying  out  implementations  on  Intel’s  t860  based 
iPSC/860,  Touchstone  Delta,  Gamma,  and  Paragon  supercomputers. 

The  global  transposition  stage,  that  interchanges  the  last  two  dimensions  of  the  distributed 
among  the  processors  data  matrix,  is  interleaved  with  ID  FFT  computations  along  the 
dimension  that  is  orthogonal  to  the  other  two,  to  hide  the  communication  cost  and  achieve 
a  much  better  processor  utilization. 

In  the  2D  row-column  FFT  case,  the  global  transposition  can  be  decomposed  into  a  number 
of  smaller  global  transpositions  of  partial  data  that  can  be  performed  concurrently  with 
the  first  stage  of  FFT  computations  (ID  FFTs  along  the  rows).  In  a  similar  fashion,  the 
second  global  transposition  step  that  is  required  if  the  results  are  to  be  returned  in  their 
original  order,  can  be  interleaved  with  the  second  FFT  computational  stage  to  totally  hide 
the  communication  costs  within  the  computations.  For  a  more  detailed  description  of  the 
approach  we  followed  please  see  Appendix  B. 

Goals:  Create  an  efficient  global  transposition  algorithm  that  interleaves  computations  with  com¬ 
munications.  Take  fuU  advantage  of  iPSC/860  hardware  that  allows  to  perform  compu¬ 
tations  at  the  same  time  with  performing  communications,  such  that  the  data  exchange 
stage  starts  at  the  same  time  with  one  of  the  computational  stages  of  the  3D  and  2D  FFT. 
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Produce  totally  scalable  efficient  codes  for  large  size  multi-dimensional  FFTs  and  evaluate 
I  their  performance. 

Applications:  The  power-of-two  FFT  has  become  a  standard  in  many  applications.  The  3D  FFT  of 
large  data  size  is  a  major  component  in  a  huge  variety  of  signal  processing  applications  in 
seismology,  oil  exploration,  crystallography,  meteorology,  motion  detection  etc.  The  large- 
size  2D  FFT  has  many  applications  ranging  from  image  processing  to  system  identification 
and  signal  reconstruction. 
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2.4  Vector-radix  on  the  Paragon 

The  Vector- Radix  (VR)  algorithm  is  a  vector  generalization  of  the  Cooley-Tukey  algorithm  for 
the  case  of  two-  and  in  general  multi-dimensional  FFT  computations.  In  a  uni-processor  en¬ 
vironment  it  has  been  shown  that  the  VR  can  result  in  more  efficient  implementations  than 
the  straight  forward  application  of  the  Row-Column  (RC)  method  that  computes  a  multi¬ 
dimensional  FFT  by  sequentially  applying  ID  FFT  along  each  of  the  dimensions,  due  mainly  to 
the  lower  frequency  of  required  accessing  of  a  particular  data  point  stored  in  the  local  memory 
than  that  required  by  the  RC.  In  shared-memory  multiprocessor  systems,  where  the  cost  of  data 
accessing  is  non-uniform,  depending  on  where  the  data  is  stored,  it  is  not  clear  that  VR  type 
of  algorithms  wiU  be  more  efficient  than  RC  method.  The  main  advantage  however  of  the  VR 
formulation  in  the  case  of  paraUel  multi-dimensional  FFTs  is  the  increased  flexibiUty  in  initial 
and  final  data  distribution  and  data/computations  flow  that  allow  for  the  design  of  codes  that 
match  in  an  optimal  way  the  target  multiprocessor  machine  parameters. 

In  the  2D  case,  VR  formulations  usually  require  three  instead  of  two  global  communication 
stages  which  makes  them  unattractive  for  implementation  on  machines  with  high  inter-processor 
communication  costs.  On  the  other  hand,  because  the  local  memory  accesses  are  much  more 
regular  than  in  the  case  of  the  RC  implementations,  for  machines  with  fast  inter-processor 
communication  links,  the  VR  results  in  more  efficient  implementations. 
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2.5  Reduced  transform  algorithms 

#  Problems:  Most  variants  of  the  Cooley- Tukey  FFT  algorithm  deal  with  FFT  computations  as  mul¬ 
tiple  stage  calculations  with  data  permutation  between  stages.  This  requires  extensive 
interprocessor  communication  for  implementing  large  size  transpositions. 


Approach:  We  present  a  strategy  for  computing  a  multidimensional  DFT  that  hybrids  a  relatively 
new  algorithm  (Reduced  Transform  Algorithm)  with  already  implemented  single  processor 
kernel  routines.  We  will  use  the  reduced  transform  algorithm  to  address  the  reduction  and 
optimization  of  interprocessor  communications.  Our  work  has  been  mainly  motivated  from 
the  distributed  memory  parallel  computing  paradigm,  which  is  arguably  the  most  difficult 
to  harness  due  to  its  exposed  interprocessor  communication  to  the  programmer.  Most 
parallel  computers  require  sophisticated  algorithms  and  programming  techniques  for  their 
optimum  utilization.  In  this  discussion,  we  wiU  make  use  of  algebraic  facts  in  presenting  the 
algorithms.  The  parameters  in  algebraic  formulas  give  us  the  important  implementation 
parameters.  Thus  the  flexibility  to  address  the  variables  in  implementations  is  equated 
with  flexibility  in  manipulating  algebraic  formalism.  Initial  investment  in  familiarity  with 
some  amount  of  algebra  may  be  necessary,  but  the  payoff  is  immediate.  Most  of  the 
relevant  algebra,  not  in  its  most  rigorous  form  but  its  usage,  can  be  found  in  [8]. 

In  its  most  general  form,  the  Reduced  transform  algorithm  (RTA)  is  a  full  utilization  of  the 
duality  between  periodic  and  decimated  data  in  the  Fourier  transform.  This  duality  was 
used  partially  in  some  algorithms  and  implementations  for  restricted  cases  [4,  2,  5,  10].  A 
description  of  a  generalization  in  a  unified  setting  is  found  in  [1,  9]  along  with  the  work 
of  M.  Rofheart  [7].  In  this  paper,  we  wiU  consider  the  application  of  RTA  to  the  case 
Z/P  X  Z/P,  for  a  prime  number  P.  Tensor  product  formulation  of  DFT  computation  on 
Z/A  X  Z/P  X  Z/P  is  interleaved  with  the  periodization  step  in  RTA  for  Z/P  x  Z/P  to 
produce  P  +  1  independent  data  of  size  NP. 

We  use  the  RTA  to  address  the  imbalance  between  computation  and  communication  rates 
in  current  distributed  memory  parallel  machines  by  reducing  communication  between  pro¬ 
cessors  to  coUective  patterns  only  (broadcast  and  combine)  instead  of  the  aJl-to-aU  com¬ 
munication  patterns  required  in  the  global  matrix  transpose  needed  by  the  row-column 
(RC)  implementations  of  multidimensional  DFT’s.  Also,  since  fast  algorithms  for  prime 
size  ID-DFT’s  exist  [8]  and  the  case  Z/P  X  Z/P  of  the  RTA  is  very  efficient  because  its 
computation  requires  only  P  -f  1  ID  transforms  (versus  2P  for  the  row  column  method), 
our  approach  addresses  the  issue  of  storage  reduction  by  providing  additional  transform 
size  options.  For  example  the  ability  to  perform  a  181 X 181  point  2D  DFT  means  potential 
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storage  savings  up  to  50%  over  the  256  X  256  case,  along  with  the  savings  in  computational 
time.  The  storage  savings  can  be  used  for  the  optimization  of  the  broadcasting  step  needed 
for  the  RTA,  in  environments  with  long  communications  latency. 

Via  the  Chinese  remainder  theorem,  we  will  extend  our  method  to  compute  the  3-dimensional 
DFT  on  Z/iV  X  Z/MP  x  Z/KP,  where  N  is  an  arbitrary  integer,  M  and  K  are  integers 
not  divisible  by  P,  for  a  prime  P.  We  transform  the  data  set  to  an  equivalent  5D  data  set 
on  Z/N  X  Z/M  X  ZjK  x  Z/P  x  Z/P,  and  then  employs  the  RTA  on  the  last  two  indices 
to  break  the  problem  into  smaller  independent  sub-problems  that  can  be  computed  in 
parallel.  Each  sub-problem  is  associated  with  the  computation  of  the  value  of  the  Fourier 
Transform  along  one  line  in  the  set  Z/P  x  Z/P  passing  through  the  origin.  These  lines 
intersect  only  at  the  origin  and  cover  the  index  space.  When  translated  from  the  5D  data 
set  back  to  the  original  3D  data,  each  line  corresponds  to  a  set  of  parallel  lines  covering 
the  index  space. 

Three  stages  are  needed  to  compute  the  values  of  the  DFT  along  the  lines:  (1)  Periodiza¬ 
tion  stage,  which  consists  of  additions  of  data  along  lines  perpendicular  to  a  given  line,  (2) 
3D  Cooley- Tukey  FFT  and  (3)  P-point  DFT.  In  a  multiprocessor  environment,  each  pro¬ 
cessor  computes  these  three  steps  independently  of  the  others  thus  allowing  for  maximum 
parallelism  and  efficiency.  Moreover,  the  final  data  distribution  among  the  processors  is 
such  as  to  permit  further  processing  in  a  parallel  fashion  since  every  processor  holds  only 
results  belonging  to  the  same  geometrical  subset. 

The  proposed  hybrid  method  (HRTA)  can  be  used  in  applications  such  as  the  computation 
of  motion  from  a  sequence  of  images  (multi-frame  detection,  MFD),  a  very  important  task 
in  computer  vision,  HDTV  and  video  telephony.  Several  methods  for  MFD  have  been 
proposed  in  the  literature  that  are  usually  divided  into  two  categories:  Time  Domain 
methods,  that  estimate  the  motion  by  processing  the  sequence  of  images  directly,  and  the 
recently  proposed  Frequency  DoinciVi  methods  [3],  [6]  that  processes  the  frequency  contents 
of  the  images  to  estimate  the  velocity  and  trajectory  of  the  moving  components.  The  latter 
methods  offer  more  robust  detection  and  huge  computational  savings  since  the  frequency 
domain  representation  of  the  3D  data  (sequence  of  2D  images)  is  more  compact  than  the 
equivalent  time  domain  representation.  W^ith  all  the  processors  holding  data  belonging 
to  different  lines  in  the  frequency  domain,  each  processor  can  independently  test  for  the 
presence  of  motion  along  its  a.ssigned  direction. 
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2.6  Non-power-of-two  scalable  DFT  library  based  on  RTA  variants 

•  Problem:  In  many  applications  the  data  size  is  not  a  power  of  two  such  that  zero  padding  has  to  be 
employed  to  use  the  efficient  FFT  algorithms.  In  the  multi-dimensional  case,  zero  padding 
increases  tremendously  the  data  size  and  hence  the  required  computational  time. 


Approach:  The  recently  proposed  RTA  is  combined  with  the  Good-Thomas  factorization  and  the 
^  standard  Cooley-Tukey  FFT  algorithm  to  give  DFT  algorithms  that  require  a  reduced 

amount  of  inter-processor  communications  at  the  expense  of  larger  data  storage  needs  and 
additional  pre-processing  stages.  The  Hybrid  RTA  variants  as  well  as  the  implementation 
^  issues  are  described  in  detail  in  Appendix  A. 

Goals:  To  create  a  totally  scalable  non-power-of-two  DFT  library  for  2D  and  3D  cases  employing 
the  concepts  of  the  RTA.  Investigate  in  detail  the  performance  and  the  tradeoffs  of  the 
new  algorithms  and  propose  efficient  hardware  structures  that  would  further  improve  the 
0  DFT  codes. 

Applications:  The  special  structure  of  the  RTA  that  computes  the  output  of  the  DFT  along  particular 
geometrical  subsets  of  the  original  index  set  can  be  used  for  the  fast  moving  target  tracking 
and  recognition,  as  weU  as  for  digital  video  compression.  The  RTA  variants  are  especially 
^  suitable  for  implementation  on  DSP  multi-processor  boards  and  clusters  of  workstations. 


# 
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2.7  Symmetrized  crystallographic  parallel  DFT  algorithms 

Problem:  In  many  applications  (crystallography,  higher  order  spectra  computations)  the  data  have 
inherent  redundancy  due  to  symmetries  in  their  structure.  In  most  cases  these  symmetries 
can  be  expressed  as  group  actions  (afRne  or  point  groups).  If  efficient  algorithms  for  the 
computation  of  the  DFT  of  such  data  are  desired,  the  inherent  data  symmetries  need  to  be 
taken  into  account  to  result  in  both  data  reduction  and  computational  savings.  Although 
considerable  work  ha.s  been  done  in  the  computation  of  symmetrized  DFTs,  algorithms 
that  can  be  implemented  in  a  parallel  machine  need  to  be  derived. 

Approach:  A  group  theoretic  approach  is  taken  to  decompose  the  data  set  into  orbtts  that  are  charac¬ 
terized  by  constant  data  value,  and  to  perform  a  data  reduction  step  by  choosing  only  one 
representative  data  point  for  each  such  orbit.  To  take  advantage  of  fast  DFT  routines,  the 
representatives  of  the  orbits  are  being  covered  with  the  minimum  number  of  lines  through 
the  data  space,  and  then  RTA  variants  are  being  employed  to  compute  the  value  of  the 
DFT  along  these  points  efficiently.  The  algorithm  is  being  generalized  for  a  large  collec¬ 
tion  of  data  sizes  by  employing  the  Chinese  Remainder  Theorem  and  the  Good-Thomas 
permutation. 

Goals:  Theoretical  study  of  symmetrized  DFT  algorithms  suitable  for  implementation  in  parallel 
multi-processor  machines.  Development  of  a  unified  theory  to  treat  aU  symmetries  usually 
encountered  in  practical  appUcations.  Development  of  a  general  symmetrized  DFT  library 
for  the  Intel  iPSC/860  and  Paragon  multiprocessors. 

Applications:  Determination  of  the  structure  of  a  crystal  from  X-ray  diffraction  data,  efficient  compu¬ 
tation  of  higher  order  statistics  for  signal  analysis  and  reconstruction  for  application  to 
material  science  and  protein  crystallography. 
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2.8  Implementation  of  integer  and  rationally  oversampled  Weyl-Heisenberg 
coefficient  computation 

Problem:  During  the  last  four  years  powerful  new  methods  have  been  introduced  for  analyzing 
Wigner  transforms  of  discrete  and  periodic  signals  based  on  finite  Weyl-Heisenberg  (W-H) 
expansions.  A  recent  work  adapted  these  methods  to  gain  control  over  the  cross-term 
interference  problem  by  constructing  signal  systems  in  time  frequency  space  for  expanding 
Wigner  transforms  from  W-H  systems  based  on  Gaussian-like  signals.  The  computational 
feasibility  of  the  method  depends  strongly  on  the  availability  of  efficient  and  stable  algo¬ 
rithms  for  computing  W-H  expansion  coefficients. 

Approach:  The  finite  Zak  transform  is  estabhshed  as  a  fundamental  and  powerful  tool  for  studying 
critically  sampled  and  rationally  oversampled  W-H  systems  and  for  designing  algorithms 
for  computing  W-H  coefficients  for  discrete  and  periodic  signals.  The  role  of  the  finite 
Zak  transform  is  analogous  to  that  played  by  the  Fourier  transform  in  replacing  complex 
convolution  computations  by  simple  pointwise  multiplication.  In  this  new  setting  proper¬ 
ties  of  W-H  systems  such  as  their  spanning  space  and  dimension  can  be  determined  by 
simple  operations  on  functions  in  Zak  space.  This  relationship  wiU  impact  on  questions  of 
existence,  parameterization  and  computation  of  W-H  expansions. 

Implementation  results  on  single  RISC  processor  of  i860  and  the  PARAGON  parallel 
multiprocessor  system  are  given  for  sample  sizes  both  of  powers  of  2  and  mixed  sizes 
with  factors  2,  3,  4,  5,  6,  7,  8,  9.  The  algorithms  described  in  this  paper  possess  highly 
parallel  structure  and  are  especially  suited  in  a  distributed  memory  parallel  processing 
environment.  Timing  results  on  single  i860  processor  and  on  4-  and  8- node  computing 
systems  show  that  real-time  computation  of  W-H  expansions  is  realizable. 

Results:  Implementation  results  on  single  RISC  processor  of  i860  and  the  PARAGON  paraUel 
multiprocessor  system  are  given  for  sample  sizes  both  of  powers  of  2  and  mixed  sizes 
with  factors  2,  3,  4,  5,  6,  7,  8,  9.  The  algorithms  described  in  this  paper  possess  highly 
parallel  structure  and  are  especially  suited  in  a  distributed  memory  parallel  processing 
environment.  Timing  results  on  single  i860  processor  and  on  4-  and  8-node  computing 
systems  show  that  real-time  computation  of  W-H  expansions  is  realizable. 
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2.9  Porting  parallel  multi-dimensional  FFT  codes  to  the  IBM  SP2  shared- 
memory  multiprocessor 

Problem:  Recent  advances  in  hardware  provide  vast  possibilities  in  machine  variations.  Expensive 
and  time-consuming  efforts  in  software  development  are  often  required  for  effective  uti¬ 
lization  of  these  advances.  In  particular,  framework  for  designing  algorithms  that  takes 
architectural  variations  become  most  urgent. 

Approach:  Tensor  product  formalism  and  the  finite  abelian  group  theory  has  been  the  major  tool 
for  our  algorithm  design  and  implementation.  Although  our  codes  have  been  optimized 
for  the  Touchstone  systems,  the  flexibility  of  our  design  tool  allowed  us  to  re-use  the 
software  and  algorithmic  skeletons  and  simply  recompile  and  relink  it  with  the  machine- 
specific  interprocessor  communication  and  one-dimensional  FFT  libraries.  The  availability 
of  efficient  one-  and  two-dimensional  FFT  codes  from  the  ESSL/6000  Engineering  and 
Scientific  Subroutine  Library,  including  both  powers  of  2  and  non-power  of  2  sizes  allowed 
us  to  design  general  purpose  parallel  2D  and  3D  FFT  codes  that  can  handle  a  wide  range 
of  sizes.  We  are  currently  in  the  process  of  porting  more  codes  to  the  IBM  SP2  including 
RTA  and  Vector-Radix  based  FFT  algorithms. 

Results:  The  parallel  FFT  codes  have  been  successfully  ported  to  the  IBM  SP2  multiprocessor 
system  of  the  NAS  NASA  Research  Center  in  less  than  a  day.  The  NAS  SP2  machine  has 
160  nodes,  each  having  at  least  128  Mbytes  of  main  memory  and  2  Gbytes  of  disk  space. 
The  SP2  nodes  are  based  on  the  RS6000/590  workstation  configuration  that  relies  on  the 
POWER2  multi-chip  RISC  processor  equipped  with  two  integer  and  two  floating  point 
computation  units  capable  of  achieving  a  peak  performance  in  the  order  of  250  MFlops. 
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2.10  Parallel  DPT  codes  for  Clusters  of  Workstations 

®  Problem:  Clusters  Of  Workstations  (COWS)  are  becoming  very  attractive  as  easily  available  alter¬ 
natives  to  expensive  parallel  supercomputers  for  certain  classes  of  problems.  Due  to  the 
special  nature  of  this  form  of  parallel  machines  (workstations  connected  via  a  common 
ethernet  cable),  row-column  methods  that  require  a  global  transposition  step  that  implies 
^  all-to-all  communication  are  highly  inefRcient.  On  the  other  hand  the  RTA  variants  that 

require  no  inter-processor  communication  at  the  expense  of  preprocessing  the  data  emerge 
as  the  only  viable  approach. 

0  Approach:  The  variants  of  the  RTA  decompose  the  task  of  DFT  computation  to  a  number  of  sub¬ 
tasks  that  can  be  computed  independently  at  the  expense  of  preprocessing  the  data.  The 
broadcasting  of  the  data  to  all  available  processors  can  be  implemented  very  efficiently  on 
the  ethernet  bus  topology,  since  aU  processors  have  access  to  the  broadcasting  medium. 
^  The  details  of  the  parallel  RTA  implementation  on  a  cluster  of  workstations  are  described 

in  detail  in  Appendix  C. 

Goals:  Develop  a  set  of  parallel  DFT  codes  for  clusters  of  workstations  that  can  be  used  when 
data  sizes  are  large  and  computational  speed  is  important.  Investigate  the  efficiency  and 
®  scalability  of  the  codes  and  improve  the  loading/unloading  of  data/results.  Investigating 

the  tradeoffs  between  the  granularity  of  the  partitioning  into  subtasks  and  the  amount  of 
data  pre-processing  to  choose  the  most  efficient  RTA  variant  for  the  particular  implemen¬ 
tation.  Experiment  with  large  clusters  of  workstations  (100-200)  and  develop  methods  for 
®  the  computation  of  Giga-size  DFTs. 

Applications:  Developing  efficient  codes  for  clusters  of  workstations  will  allow  the  processing  and  analysis 
of  data  sets  much  larger  than  with  the  computers  available  today  to  advance  the  research 
^  and  understanding  in  a  variety  of  applications  in  biomedical  engineering,  image  processing, 

systems  identification  etc. 
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3  Implementation  Results 

3.1  Data  partition  and  migration  schemes 

Data  partition  and  migration  for  eiRcient  communication  in  distributed  memory  architectures 
are  critical  for  performance  of  data  parallel  programs.  We  have  developed  a  formal  methodology 
for  the  process  of  data  distribution  and  redistribution  in  terms  of  tensor  products  and  stride  per¬ 
mutations.  The  algebraic  expressions  representing  data  partition  and  migration  directly  operate 
on  a  data  vector,  and  hence  can  be  conveniently  embedded  into  an  algorithm.  It  is  also  shown 
that  these  expressions  are  useful  for  a  clear  understanding  and  for  efficiently  embedding  into 
problems  that  involve  different  data  distributions  at  different  phases.  A  unique  data  distribution 
technique  that  effectively  uses  transpose  algorithms  for  multiplication  of  two  rectangular  matri¬ 
ces  is  derived.  Performance  of  these  algorithms  are  evaluated  by  carrying  out  implementations 
on  Intel’s  i860  based  iPSC/860,  Touchstone  Delta  and  Paragon  supercomputers. 
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3.1.1  Matrix  transpose  algorithms  in  three  data  distributions 
Results  of  transpose  algorithms  on  Paragon 


M 

N 

Row- Division 

ms 

Col- Division 

ms 

Mesh-Division 

ms 

128 

128 

5.236 

6.172 

1.316 

128 

256 

5.902 

7.051 

2.028 

128 

512 

9.031 

10.409 

2.159 

128 

1024 

12.356 

15.312 

3.866 

256 

128 

5.501 

6.665 

1.825 

256 

256 

8.283 

9.746 

2.301 

256 

512 

11.483 

14.027 

4.018 

256 

1024 

20.076 

22.503 

7.548 

512 

128 

8.310 

9.432 

3.450 

512 

256 

11.555 

13.359 

5.905 

512 

512 

18.536 

21.122 

7.954 

512 

1024 

39.628 

38.529 

16.434 

1024 

128 

11.228 

13.132 

5.815 

1024 

256 

17.526 

20.616 

10.631 

1024 

512 

31.211 

37.445 

20.889 

1024 

1024 

50.936 

66.403 

49.274 
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Results  of  transpose  algorithms  on  Touchstone  Delta 


N 

Row-Division 

ms 

Col-Division 

ms 

Mesh- Division 

ms 

128 

8.092 

8.865 

2.681 

256 

10.042 

12.280 

5.769 

512 

13.988 

18.980 

11.702 

1024 

23.909 

33.014 

20.018 

128 

10.065 

12.016 

5.041 

256 

14.228 

18.150 

11.554 

512 

23.030 

31.237 

20.088 

1024 

43.458 

59.109 

36.009 

128 

13.982 

17.920 

9.822 

256 

23.002 

30.593 

19.637 

512 

44.178 

57.799 

36.091 

1024 

95.145 

114.215 

79.681 

128 

22.743 

30.400 

19.507 

256 

42.197 

57.171 

36.109 

512 

83.011 

113.416 

79.492 

1024 

187.484 

223.287 

167.497 
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3.1.2  Switching  data  partition  schemes  in  application 

We  consider  the  implement ation  of  mimericaJ  solution  to  Euler  partial  diflFerential  equation 
(PDE)  for  two-dimensional  case  using  wavelet- Galer kin  method.  The  two  most  important  com¬ 
putation  modules  in  this  solution  require  two  different  data-partitions  for  their  optimal  imple¬ 
mentation.  First  module,  Helmholtz,  involves  two-dimensional  filtering  with  forward  and  inverse 
Fourier  transform  methods.  The  second  module  computes  Jacobian  that  consists  of  numerous 
small  intra-node  matrix  multiplications.  The  module  Jacobian  requires  boundary  data  from 
other  nodes,  but  upto  the  necessity  for  neighboring  spatial  regions  to  exchange  data,  choice  of 
any  data-partitioning  shows  ideal  concurrency,  with  no  sequential  dependence  of  one  processor’s 
calculation  on  other’s.  Departure  from  ideal  speedup  in  evolution  of  Jacobian  arises  because 
the  elements  on  node  boundaries  must  be  shared  by  geometrically  neighboring  processors.  Min¬ 
imization  of  the  elements  on  the  boundaries  minimizes  the  internode  communication,  leading  to 
the  most  optimal  parallel  implementation. 

Optimal  implementation  of  Helmholtz  requires  the  data  distribution  along  rows  or  columns  of 
the  data  array,  while  Jacobian  requires  the  data  distribution  in  2-dimensional  subarrays  (mesh- 
division).  Switching  between  row-division  and  mesh-division  data-partitions  is  required  to  make 
use  of  the  peak  performances  of  these  modules  individually. 
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Two-dimensional  double-precision  complex  FFT  implementation  results 


(1)  iPSC/860  library  code,  (2)  Interface  routines  appended  at  input  and  output,  (3)  Algorithm-1, 
(4)  Algorithm-2. 


Problem  Size 

Nodes 

Intel 

Interface 

Algorithm- 1 

Algorithm-2 

ms 

ms 

ms 

ms 

32  X  32 

4 

0.06054 

0.13752 

0.12409 

0.08476 

16 

0.12427 

0.25118 

0.20137 

0.13195 

64  X  64 

4 

0.15091 

0.31761 

0.28070 

0.23038 

16 

0.13451 

0.26424 

0.23804 

0.17571 

64 

0.48014 

0.72160 

0.53387 

0.39570 

128  X  128 

4 

0.50754 

0.96545 

0.86153 

0.76560 

16 

0.24929 

0.44145 

0.42941 

0.33604 

64 

0.49421 

0.76185 

0.58775 

0.43177 

256  X  256 

4 

1.94816 

3.43353 

3.17574 

2.91836 

16 

0.60610 

1.13566 

1.15002 

1.00119 

64 

0.57530 

0.94583 

0.82859 

0.64410 

256 

1.96009 

2.73886 

1.66710 

1.54402 

512  X  512 

4 

8.58407 

14.55625 

13.08064 

12.30499 

16 

2.37530 

4.07935 

4.16807 

3.81806 

64 

1.09181 

2.17609 

1.92430 

1.63670 

256 

2.54740 

2.90163 

2.29605 

1.96358 

Timing  results  for  128  X  128  size  vorticity  computations 


Nodes 

Jacobian 

Helmholtz 

Total 

row-D 

Mesh 

row-D 

Meshl 

Mesh2 

row-D 

Meshl 

Mesh2 

2.8317 

2.7939 

0.11216 

0.18218 

0.16298 

2.9438 

2.9761 

2.9568 

0.7310 

0.06094 

0.09950 

0.07688 

0.8738 

0.8305 

0.8079 

0.1996 

0.10510 

0.12022 

0.08916 

0.4146 

0.3198 

0.2887 
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3.1.3  Parallel  matrix  multiplication  algorithm  for  rectangular  arrays 

Many  applications  have  numerical  solutions  in  which  required  computation  is  presented  as  matrix 
operations.  One  of  the  most  elementary  operations  involving  matrices  is  multiplication  of  two 
matrices.  However,  since  matrix  multiplication  requires  substantially  more  data  movement  than 
most  other  operations,  algorithms  that  address  efficient  data  movement  is  crucial  for  effective 
implementation  on  concurrent  computers. 

We  have  reviewed  and  implemented  an  existing  matrix  multiplication  algorithm  that  gener¬ 
ates  and  accumulates  partial  results  by  moving  multiplicands  through  a  set  of  broadcasts  and 
shifts.  Two  extreme  cases  of  data  decomposition  strategies  cases  involve  either  only  a  set  of 
broadcasts  or  only  a  set  of  shifts.  We  designed  a  different  approach  that  replaces  broadcasts  or 
shifts  by  matrix  transpose.  Identification  of  shortcomings  in  the  two  extreme  cases  of  broadcast- 
and-shift  algorithm  and  the  fact  that  dot  product  of  two  vectors  result  in  a  single  element  is 
the  motivation  for  this  new  approach.  Then,  to  overcome  the  hurdles  in  memory  requirement, 
we  modified  the  algorithm  for  efficient  data  manipulation  with  the  aid  of  block  transpose  algo¬ 
rithm.  We  present  evaluation  of  communication  costs  in  broadcast-and-shift  algorithm  versus 
new  approach  and  timing  results  of  their  implementations  on  Intel  s  Paragon,  Touchstone  Delta 
and  iPSC/860. 

3.1.4  Implementation  results  on  matrix  multiplication  algorithm 


Timing  results  for  routing  scheme  in  new  method 


iVi  N2  N3 

2-nodes 

.  .  j 

4-nodes 

8*nodes 

16-nodes 

32  512  32 

0.495 

1.049 

2.294 

4.870 

64  512  64 

0.801 

1.827 

3.348 

4.970 

128  512  128 

2.238 

4.375 

5.775 

8.953 

256  512  256 

7.107 

12.377 

16.724 

22.357 

512  512  512 

27.340 

44.108 

57.234 

67.113 
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Timing  results  for  routing  schemes  in  matrix  multiplication  algorithms  on  16  node 
Paragon 


Ni  N2  N3  B-S  Algor.  New  App.  Performance 


128  32 
128  64 

128  128 
128  32 

128  32 
128  32 
256  32 

256  64 

256  128 

256  32 

256  64 

256  32 

256  64 

256  32 

512  32 

512  64 

512  128 
512  256 

512  32 

512  64 

512  128 
512  32 

512  64 

512  32 

512  64 


11.811 

9.769 

10.313 

12.108 

15.429 

22.604 

11.185 

11.753 

12.853 

14.466 

14.993 
20.114 
20.618 
37.661 
15.005 
16.127 
18.651 
22.296 
20.647 
21.557 
24.351 
32.073 
32.874 
66.446 

54.994 


Improvement 

119.35 


13.469 


13.529 

13.518 

5.273 


13.496 


9.360 

13.468 

9.333 

13.487 

13.524 

23.743 


65.37 


108.88 


115.33 


178.59 

184.58 

114.71 


172.70 

130.31 
80.81 

243.65 

143.74 

391.31 
131.63 


Implementation  Results 

Timing  results  for  routing  schemes  in  matrix  multiplication  algorithms  on  16  node 
iPSC/860 


Ni  N2  N3 


128  128 
256  128 
512  128 
128  256 
128  256 
128  256 
256  256 

256  256 

512  256 

512  256 
128  512 
128  512 

128  512 
128  512 
128  512 
256  512 

256  512 

256  512 

256  512 

512  512 

512  512 

512  512 


B-S  Algor. 


26.504 

44.936 

80.056 

47.235 

53.787 

65.542 

82.382 

89.350 

152.873 

159.157 


101.255 

124.579 

185.685 

304.817 

159.439 

171.299 

197.331 


New  App.  Performance 
Improvement 


18.586 

30.932 

55.328 

19.506 

30.849 

55.086 

30.829 

53.987 

55.152 

102.409 

18.271 

31.236 

55.067 

101.661 

198.436 
30.950 
55.532 

101.436 


245.848  198.612 


300.573 


53.400 


312.586  102.097 

339.101  199.487 


42.60 

45.27 

44.69 

142.15 


167.22 


177.18 


385.54 

224.16 

126.23 


415.15 

208.47 


23.78 

462.87 

206.17 

69.97 


Implementation  Results 


Paragon  nodes  (50  MHz  i860) 

The  corresponding  timings  for  the  Kuck  k  Associates  power-of-two  FFT  routines  are  given  for 
comparison  purposes.  For  both  codes,  the  timings  have  been  performed  using  the  same  method. 
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Complex,  single  precision  2D  FFTs  on  Paragon 

The  routines  are  coded  in  Fortran  calling  upon  small  size  DFT  assembly  modules.  Cooley-Tuckey 
algorithm  is  used. 


size 

time  (ms) 

size 

time  (ms) 

40  X  40 

2.138 

40  X  36 

1.753 

40  X  35 

1.815 

40  X  28 

1.213 

40  X  25 

0.930 

40x24 

0.832 

40  X  20 

0.651 

40  X  15 

0.516 

40  X  12 

0.371 

36x40 

1.799 

36  X  36 

1.493 

36x35 

1.529 

36  X  28 

0.968 

36  X  25 

0.799 

36  X  24 

0.713 

36  X  20 

0.574 

36  X  12 

0.328 

32  X  32 

1.114 

20  x40 

0.651 

20  X  36 

0.594 

20  X  35 

0.595 

20  X  32 

0.480 

20  X  28 

0.440 

20  X  25 

0.382 

20  X  24 

0.379 

20x  20 

0.311 

20  X  16 

0.224 

20  X  15 

0.249 

20  X  12 

0.187 

15x40 

0.507 

15  X  36 

0.452 

15x  35 

0.465 

15  X  32 

0.382 

15x  28 

0.351 

15  X  25 

0.335 

15x  24 

0.303 

15  X  20 

0.247 

15  X  16 

0.188 

15  X  15 

0.197 

15  X  12 

0.143 

12  x40 

0.371 

12  X  36 

0.329 

12  X  35 

0.344 

12  X  32 

0.280 

12  X  28 

0.267 

12  X  25 

0.248 

12  X  24 

0.224 

12x  20 

0.188 

12  X  16 

0.130 

12  X  15 

0.145 

12  X  12 

0.111 
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Complex,  single  precision  2D  RTA  FFTs  on  Paragon 

The  routines  are  coded  in  Fortran  calling  upon  small  size  DFT  assembly  modules.  The  Hybrid 
RTA  (HRTA)  algorithm  is  used.  Results  are  given  for  sizes:  x  p2"  where  p  is  a  prime 

number  (3,5  or  7).  timel  referes  to  the  time  computational  time  when  the  data  are  already 
permuted  (CRT  has  been  pre-applied)  and  the  output  is  being  obtained  on  the  algebraic  lines. 
time2  refers  to  the  time  required  when  the  input  is  in  its  natural  order  (column-wise)  and  so  is 
the  output. 


size 

timel  (ms) 

time2  (ms) 

size 

timel  (ms) 

time2  (ms) 

24 

X 

24 

2.54 

3.47 

48 

X 

48 

8.41 

22.27 

96 

X 

48 

16.07 

52.18 

96 

X 

96 

31.99 

100.10 

192 

X 

96 

63.72 

205.68 

192 

X 

192 

127.75 

429.32 

384 

X 

192 

262.48 

895.22 

384 

X 

384 

600.77 

1882.31 

768 

X 

384 

1206.76 

3823.15 

768 

X 

768 

2672.44 

7903.65 

20 

X 

20 

2.52 

5.13 

40 

X 

20 

3.89 

9.31 

40 

X 

40 

6.26 

17.23 

80 

X 

40 

11.71 

34.03 

80 

X 

80 

22.62 

66.52 

160 

X 

80 

45.31 

134.27 

160 

X 

160 

90.51 

278.01 

320 

X 

160 

180.81 

576.84 

320 

X 

320 

361.13 

1181.26 

640 

X 

320 

773.75 

2500.96 

640 

X 

640 

1741.80 

5217.32 

28 

X 

28 

4.52 

9.67 

56 

X 

28 

7.37 

17.64 

56 

X 

56 

12.57 

33.39 

112 

X 

56 

24.05 

65.64 

112 

X 

112 

46.06 

131.22 

224 

X 

112 

91.57 

265.79 

224 

X 

224 

179.97 

549.69 

448 

X 

224 

357.37 

1140.63 

448 

X 

448 

729.56 

2339.62 

896 

X 

448 

1584.03 

4904.57 
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3.3  Multi-processor  codes 

3.3.1  Complex,  ID  single  precision  FFTs 

The  Cooley-Tuckey  formulation  is  being  used.  The  1-D  FFT  computation  is  formulated  as 
a  2-D  FFT  with  intermediate  twiddle  factors  multiplication.  Three  global  transpositions  are 
required  for  the  in-place  1-D  FFT  and  two  if  it  is  not  required  that  the  distribution  of  the 
results  coincides  with  that  of  the  data.  Furtehrmore,  only  one  global  transposition  is  required 
if  the  initial  distribution  of  the  data  is  assumed  to  be  in  a  strided  (transposed)  fashion.  In  the 
following  table,  all  times  are  in  sec  x  10”^,  timel  refers  to  the  in-place  version  and  time2  to  the 
out-of-place  version. 
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3.3.2  Complex,  2D  single  precision  FFTs 

The  hypercube  transpose  algorithm  is  used  for  both  implementations.  In  the  Intel  code,  the 
data  is  being  assumed  to  be  distributed  row- wise  (C  convention)  and  in  the  Aware  codes  the 
data  are  distributed  column-wise  (Fortran  convention).  Two  sets  of  timings  are  being  reported; 
For  the  first  set  of  timings  {timel),  two  global  data  transpositions  are  required  so  that  the  final 
distribution  of  the  results  is  the  same  as  the  original  data  distribution.  In  the  second  [time2), 
the  second  global  data  transposition  is  being  ommited. 

The  Intel  code,  originally  designed  for  the  iPSC/860  hypercube,  is  using  synchronous  com¬ 
munication  calls  (csend)  whereas  the  Aware  code  uses  asynchronous  communication  calls  (isend). 
The  Aware  code  breaks  the  global  transposition  stage  into  two  partial  global  transpositions  and 
that  are  being  performed  concurrently  with  one-dimensional  FFTs  on  the  nodes. 
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Timing  results  on  the  Caltech  Delta 


timel  =  in  place,  ms,  Intel  time2  =  transposed,  ms,  Intel 


timela  =  in  place,  ms.  Aware  time2a  =  transposed,  ms.  Aware 


nodes 

size 

timel 

time2 

timela 

time2a 

2 

1024  X  512 

2205 

1706 

1555 

1041 

512  X  512 

962 

737 

764 

507 

512  X  256 

433 

332 

360 

231 

256  X  256 

196 

145 

158 

108 

256  X  128 

99 

72 

85 

56 

128  X  128 

51 

36 

40 

31 

4 

1024  X  1024 

2227 

1647 

1760 

1154 

1024  X  512 

1014 

750 

878 

572 

512  X  512 

472 

344 

424 

275 

512  X  256 

230 

165 

194 

126 

256  X  256 

113 

80 

96 

74 

256  X  128 

61 

44 

55 

36 

8 

2048  X  1024 

2464 

1869 

1955 

1282 

1024  X  1024 

1040 

749 

966 

680 

1024  X  512 

514 

364 

481 

328 

512  X  512 

247 

173 

226 

147 

512  X  256 

131 

90 

118 

75 

256  X  256 

67 

46 

41 

43 
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nodes 

size 

timel 

time2 

timela 

time2a 

16 

2048  X  2048 

2728 

2078 

2292 

1491 

2048  X  1024 

1212 

893 

1184 

719 

1024  X  1024 

533 

370 

520 

333 

1024  X  512 

273 

186 

255 

184 

512  X  512 

133 

90 

134 

82 

512  X  256 

73 

50 

89 

53 

32 

4096  X  2048 

3406 

2278 

3116 

1887 

2048  X  2048 

1656 

1099 

1419 

859 

2048  X  1024 

702 

490 

694 

414 

1024  X  1024 

413 

208 

313 

188 

1024  X  512 

168 

108 

183 

108 

512  X  512 

89 

60 

129 

69 

64 

4096  X  4096 

3856 

2669 

3741 

2213 

4096  X  2048 

1753 

1240 

1770 

1059 

2048  X  2048 

1018 

599 

790 

496 

2048  X  1024 

440 

272 

475 

242 

1024  X  1024 

362 

127 

221 

103 

1024  X  512 

135 

140 

165 

- 
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The  hypercube  transpose  algorithm  is  used.  The  data  is  assumed  to  be  distributed  column- wise 
(Fortran  convention).  For  the  first  set  of  timings  {timel),  two  global  data  transpositions  are 
required  so  that  the  final  distribution  of  the  results  is  the  same  as  the  original  data  distribution. 
In  the  second  {time2),  the  second  global  data  transposition  is  being  ommited.  The  code  is  using 
synchronous  communication  calls  and  is  based  on  the  original  example  provided  in  the  iPS C/860 
manuals.  For  the  non-power-of-two  ID  FFTs,  in-house  codes  (Hbdft.a)  are  being  employed.  The 
timings  are  being  reported  for  only  a  Umited  number  of  cases.  Other  DFT  sizes,  as  weU  as  mixed 
DFT-FFT  cases  can  be  treated  as  well. 

Timings  on  the  Paragon  (non-power-of-2  sizes) 


timel  =  in  place,  ms,  Aware  time2  =  transposed,  ms.  Aware 


nodes 

size 

timel 

time2 

nodes 

size 

timel 

time2 

2 

224  X  224 

83 

64 

4 

1792  X  1792 

3291 

2642 

300  X  300 

165 

133 

8 

224  X  224 

36 

24 

360  X  360 

209 

164 

360  X  360 

77 

55 

448  X  448 

335 

264 

448  X  448 

110 

81 

576  X  576 

562 

445 

576  X  576 

174 

132 

640  X  640 

759 

616 

640  X  640 

226 

175 

800  X  800 

1134 

908 

800  X  800 

333 

256 

900  X  900 

1530 

1234 

1280  X  1280 

901 

715 

1280  X  1280 

3239 

2657 

1440  X  1440 

1104 

870 

1440  X  1440 

3979 

3238 

1600  X  1600 

1392 

1105 

4 

224  X  224 

51 

38 

1792  X  1792 

1711 

1380 

300  X  300 

98 

75 

16 

224  X  224 

32 

21 

360  X  360 

121 

91 

448  X  448 

74 

51 

448  X  448 

188 

143 

576  X  576 

111 

78 

576  X  576 

309 

238 

640  X  640 

140 

103 

640  X  640 

410 

327 

800  X  800 

199 

147 

800  X  800 

610 

478 

1280  X  1280 

486 

379 

900  X  900 

831 

654 

1440  X  1440 

599 

463 

1280  X  1280 

1730 

1379 

1600  X  1600 

743 

581 

1440  X  1440 

2107 

1685 

1792  X  1792 

908 

711 

1600  X  1600 

2670 

2145 

Implementation  Results 


43 


3.3.3  Real-to-Hermitian,  2D  single  precision  FFTs 

The  hypercube  transpose  algorithm  is  used.  The  data  is  assumed  to  be  distributed  column-wise 
(Fortran  convention).  The  timings  do  not  include  the  final  transposition  stage,  so  that  the 
results  are  obtained  distributed  along  the  first  dimension. 

The  code  is  using  asynchronous  communication  calls  and  interleaved  computation/ communication 
is  being  used.  Each  node,  partitions  the  local  data  into  two  subsets  and  performs  ID  FFTs  on 
one  subset  while  transposing  the  other  one. 
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The  hypercube  transpose  algorithm  is  used.  The  data  is  assumed  to  be  distributed  column-wise 
(Fortran  convention).  The  timings  do  not  include  the  final  transposition  stage,  so  that  the 
results  are  obtained  distributed  along  the  first  dimension. 

The  code  is  using  synchronous  communication  calls  and  the  hypercube  transpose  algorithm. 

Timings  on  the  Paragon  (power  of  2  sizes):  version  il 


time  =  transposed,  ms.  Aware 


nodes 

size 

time 

nodes 

size 

time 

2 

128  X  128 

10 

8 

1024  X  2048 

443 

128  X  256 

20 

2048  X  2048 

959 

256  X  256 

39 

2048  X  4096 

1728 

256  X  512 

80 

4096  X  4096 

3929 

512  X  512 

163 

16 

128  X  128 

11 

512  X  1024 

378 

128  X  256 

13 

1024  X  1024 

818 

256  X  256 

15 

1024  X  2048 

1696 

256  X  512 

23 

2048  X  2048 

3576 

512  X  512 

34 

4 

128  X  128 

7 

512  X  1024 

64 

128  X  256 

12 

1024  X  1024 

126 

256  X  256 

23 

1024  X  2048 

239 

256  X  512 

46 

2048  X  2048 

479 

512  X  512 

89 

2048  X  4096 

850 

512  X  1024 

197 

4096  X  4096 

1997 

1024  X  1024 

428 

32 

128  X  128 

21 

1024  X  2048 

887 

128  X  256 

21 

2048  X  2048 

1855 

256  X  256 

23 

2048  X  4096 

3351 

256  X  512 

26 

8 

128  X  128 

7 

512  X  512 

32 

128  X  256 

10 

512  X  1024 

50 

256  X  256 

16 

1024  X  1024 

80 

256  X  512 

28 

1024  X  2048 

142 

512  X  512 

51 

2048  X  2048 

271 

512  X  1024 

108 

2048  X  4096 

463 

1024  X  1024 

225 

4096  X  4096 

997 
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3.3.4  Complex-to-Complex,  3D  FFT 

The  hypercube  transpose  algorithm  is  used.  The  data  is  assumed  to  be  distributed  along  the 
last  dimension.  For  the  first  set  of  timings  (timel),  two  global  data  transpositions  are  required 
so  that  the  final  distribution  of  the  results  is  the  same  as  the  original  data  distribution.  In 
the  second  {tiTne2),  the  second  global  data  transposition  is  being  ommited.  The  code  is  using 
synchronous  communication  caUs  and  is  based  on  the  original  example  provided  in  the  iPSC/860 
manuals.  Synchronous  communication  calls  are  being  used. 

Timing  results  on  the  Delta  (power  of  2  sizes  -  single  precision) 

timel  =  in  place,  ms,  Aware  time2  =  transposed,  ms.  Aware 


nodes 

size 

timel 

time2 

2 

32  X  32  X  32 

122 

99 

64  X  32  X  32 

230 

187 

64  X  64  X  32 

442 

358 

64  X  64  X  64 

843 

682 

128  X  64  X  64 

1653 

1335 

4 

32  X  32  X  32 

70 

54 

64  X  32  X  32 

133 

103 

64  X  64  X  32 

254 

196 

64  X  64  X  64 

489 

373 

128  X  64  X  64 

959 

732 

128  X  128  X  64 

1893 

1441 

8 

32  X  32  X  32 

43 

30 

64  X  32  X  32 

76 

56 

64  X  64  X  32 

143 

105 

64  X  64  X  64 

345 

198 

128  X  64  X  64 

550 

386 

128  X  128  X  64 

1047 

759 

128  X  128  X  128 

2021 

1590 

16 

32  X  32  X  32 

28 

16 

64  X  32  X  32 

41 

25 

64  X  64  X  32 

61 

41 

64  X  64  X  64 

101 

72 

128  X  64  X  64 

184 

136 

128  X  128  X  64 

347 

263 

128  X  128  X  128 

669 

517 

256  X  128  X  128 

1315 

1037 
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The  hypercube  transpose  algorithm  is  used.  The  data  is  assumed  to  be  distributed  along  the 
last  dimension.  For  the  first  set  of  timing  results  (timel),  two  global  data  transpositions  are 
required  so  that  the  final  distribution  of  the  results  is  the  same  as  the  original  data  distribution. 
In  the  second  (timeS),  the  second  global  data  transposition  is  being  ommited.  The  code  is  using 
synchronous  communication  caUs  and  is  based  on  the  original  example  provided  in  the  iPSC/860 
manuals.  Synchronous  communication  calls  are  being  used.  The  local  data  permutations  are 
being  performed  by  using  Kuck  &  Associates  library  matrix  transposition  calls. 

Timings  on  the  Paragon  (power  of  2  sizes-single  precision) 


timel  =  in  place,  ms,  Aware  time2  =  transposed,  ms.  Aware 


nodes 

size 

timel 

time2 

nodes 

size 

timel 

time2 

2 

32  X  32  X  32 

71 

58 

16 

32  X  32  X  32 

27 

16 

64  X  32  X  32 

139 

113 

64  X  32  X  32 

40 

25 

64  X  64  X  32 

279 

230 

64  X  64  X  32 

61 

41 

64  X  64  X  64 

561 

461 

64  X  64  X  64 

103 

73 

128  X  64  X  64 

1114 

919 

128  X  64  X  64 

188 

139 

4 

32  X  32  X  32 

47 

36 

128  X  128  X  64 

353 

269 

64  X  32  X  32 

83 

65 

128  X  128  X  128 

684 

531 

64  X  64  X  32 

154 

124 

256  X  128  X  128 

1343 

1063 

64  X  64  X  64 

300 

242 

32 

32  X  32  X  32 

43 

23 

128  X  64  X  64 

583 

474 

64  X  32  X  32 

47 

26 

128  X  128  X  64 

1165 

953 

64  X  64  X  32 

56 

33 

8 

32  X  32  X  32 

29 

20 

64  X  64  X  64 

82 

52 

64  X  32  X  32 

49 

35 

128  X  64  X  64 

124 

84  , 

64  X  64  X  32 

91 

68 

128  X  128  X  64 

215 

155 

64  X  64  X  64 

171 

131 

128  X  128  X  128 

398 

294 

128  X  64  X  64 

325 

259 

256  X  128  X  128 

740 

567 

128  X  128  X  64 

644 

509 

256  X  256  X  256 

1427 

1114 

128  X  128  X  128 

1282 

1023 
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The  hypercube  transpose  algorithm  is  used.  The  data  is  assumed  to  be  distributed  along  the 
last  dimension.  For  the  first  set  of  timing  results  {timel),  two  global  data  transpositions  are 
required  so  that  the  final  distribution  of  the  results  is  the  same  as  the  original  data  distribution. 
In  the  second  itime2),  the  second  global  data  transposition  is  being  ommited.  The  code  is  using 
synchronous  communication  caUs  and  is  based  on  the  original  e.xample  provided  in  the  iPSC/860 
manuals.  Synchronous  communication  caUs  are  being  used.  The  local  data  permutations  are 
being  performed  by  using  Kuck  &  Associates  library  matrix  transposition  calls.  Since  the  Ubrary 
contlins  only  double  precision  real  transpositions,  the  double  precision  complex  transpositions 
have  been  reformulated  in  terms  of  the  available  library  functions. 

Timings  on  the  Paragon  (power  of  2  sizes-double  precision) 

timel  =  in  place,  ms.  Aware  time2  =  transposed,  ms.  Aware 


nodes 

size 

timel 

time2 

16 

32  X  32  X  32 

40 

30 

64  X  32  X  32 

59 

45 

64  X  64  X  32 

98 

76 

64  X  64  X  64 

177 

138 

128  X  64  X  64 

326 

258 

128  X  128  X  64 

624 

501 

128  X  128  X  128 

1228 

998 

32 

32  X  32  X  32 

48 

27 

64  X  32  X  32 

58 

34 

64  X  64  X  32 

82 

53 

64  X  64  X  64 

120 

83 

128  X  64  X  64 

204 

148 

128  X  128  X  64 

373 

278 

128  X  128  X  128 

687 

532 

256  X  128  X  128 

1511 

1240 

nodes 

size 

timel 

time2 

2 

32  X  32  X  32 

141 

118 

64  X  32  X  32 

274 

229 

64  X  64  X  32 

539 

451 

64  X  64  X  64 

1069 

894 

4 

32  X  32  X  32 

82 

66 

64  X  32  X  32 

151 

128 

64  X  64  X  32 

291 

237 

64  X  64  X  64 

567 

466 

128  X  64  X  64 

1126^ 

926 

8 

32  X  32  X  32 

52 

39 

64  X  32  X  32 

90 

70 

64  X  64  X  32 

164 

133 

64  X  64  X  64 

307 

246 

128  X  64  X  64 

593 

479 

128  X  128  X  64 

1181 

962 
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3.4  Vector  Radix  (VR)  on  the  Paragon 
3.4.1  2D  Vector  Radix  (VR)  on  the  Paragon 
Implementation  Results 

We  have  implemented  the  paraUel  algorithms  described  in  the  previous  section  on  an  Intel 
Paragon  multiprocessor  system,  that  is  based  on  the  i860XR  microprocessor  and  employs  a 
mesh  interconnection  network.  Optimized  assembly-coded  routines  for  the  nodes  include  ID 
FFTs,  routines  from  BIAS  and  matrix  transposition  routines.  The  RC  method  can  be  made 
very  efficient  since  optimized  ID  FFT  routines  can  be  used.  For  the  partial  VR  algorithm, 
the  computation  of  the  p-point  FFTs  (p  is  the  number  of  nodes)  is  being  performed  either 
via  optimized  hand  coded  assembly  routines  that  perform  strided  smaU-sized  FFTs  with  twid¬ 
dle  factor  multipUcation,  or  by  performing  the  butterflies  expHcitly  using  vectorized  complex 
multiply- accumulate  routines  from  the  BLAS  library. 

In  Table  1,  we  compare  the  RC  and  PVR  implementations  for  a  variety  of  test  and  machine 
sizes.  Although  the  PVR  method  has  not  been  fuUy  optimized  it  performs  generally  better  than 
the  RC  with  the  advantage  being  more  evident  for  relatively  small  sized  machine  partitions.  For 
more  than  16  nodes,  the  PVR  algorithm  performs  only  slightly  better  than  the  RC,  however 
substantial  optimization  can  be  performed. 

In  Table  3.4.1,  we  compare  the  CoUect-Distribute  (CD)  implementation  with  the  FuU  VR.  In 
both  implementations  the  2D  data  are  being  distributed  along  both  dimensions  and  the  results 
are  obtained  in-place.  Again,  as  in  the  case  of  the  RC,  the  CD  method  has  the  advantage  of 
using  highly  optimized  ID  FFT  routines,  at  the  expense  of  increased  data  movements.  Clearly, 
as  we  can  see  from  Table  3.4.1,  the  FVR  implementation  is  more  efficient  that  the  CD  method, 
and  additional  optimization  in  the  computation  of  the  radix  pX  q  FFTs  is  possible. 
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m 

n 

nodes 

PVR 

RC 

256 

256 

2 

83 

90 

256 

512 

162 

190 

512 

512 

390 

400 

512 

1024 

690 

918 

1024 

1024 

1581 

2065 

256 

512 

4 

96 

109 

512 

512 

187 

229 

512 

1024 

371 

495 

1024 

1024 

829 

1093 

1024 

2048 

1729 

2282 

2048 

2048 

3584 

4742 

512 

512 

8 

113 

123 

512 

1024 

210 

267 

1024 

1024 

449 

582 

1024 

2048 

900 

1186 

2048 

2048 

1853 

2443 

512 

512 

16 

84 

66 

512 

1024 

140 

127 

1024 

1024 

254 

260 

1024 

2048 

484 

522 

2048 

2048 

973 

1110 

2048 

4096 

1945 

2061 

512 

512 

32 

93 

71 

512 

1024 

119 

104 

1024 

1024 

189 

185 

1024 

2048 

318 

334 

2048 

2048 

542 

608 

2048 

4096 

1021 

1115 

4096 

4096 

2036 

2087 

Comparison  of  the  partial  Vector-Radix  approach  and  the  Row  Column  optimized  im¬ 
plementation  (execution  times  are  in  milliseconds). 
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m 

n 

nodes 

FVR 

CD 

256 

512 

4 

119 

155 

512 

512 

230 

301 

512 

1024 

464 

606 

1024 

1024 

951 

1237 

1024 

2048 

2111 

2676 

512 

512 

8 

132 

- 

512 

1024 

249 

- 

1024 

1024 

487 

- 

1024 

2048 

1062 

- 

2048 

2048 

2180 

512 

512 

16 

81 

99 

512 

1024 

151 

188 

1024 

1024 

275 

377 

1024 

2048 

546 

750 

2048 

2048 

1104 

1559 

2048 

4096 

2402 

3106 

Table  1:  Timings  for  the  FuU  VR  implementation  (execution  times  are  in  milliseconds) 


Implementation  Results 


52 


3.4.2  The  3D  Vector- Radix  Implementation  on  the  Paragon 

Several  variations  of  the  3D  VR  algorithms  have  been  implemented  for  a  variety  of  machine  sizes. 
The  VR  algorithm  offers  a  larger  flexibility  in  data  and  computation  flows  as  well  as  initial  and 
final  data  distribution.  In  the  timing  results  reported  next,  data  are  assumed  to  be  distributed 
along  the  last  dimension.  Since  in  the  3D  case,  the  length  of  the  ID  FFTs  that  have  to  be 
computed  is  in  general  considerably  smaUer  than  in  the  2D  case  (assuming  that  data  should 
have  sizes  such  that  they  can  fit  into  the  processors  local  memory),  efficient  vectorized  FFT 
routines  have  been  written.  Although  these  routines  are  coded  in  Fortran,  when  they  are  used 
to  compute  vectorized  FFTs  of  highly  rectangular  data  structures  they  perform  substantially 
better  than  the  optimized  assembly  coded  library  ID  FFT  routines.  The  greater  flexibility  that 
the  3D  VR  algorithm  offers  as  well  as  other  improvements  in  inter-processor  communication 
strategies  resulted  in  codes  that  are  more  than  twice  as  fast  than  the  corresponding  RC  3D 
codes  especially  for  relatively  small  sized  machine  configurations. 
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Timing  results 


nodes 

size 

timel 

time2 

2 

32  X  32  X  32 

60  (141) 

54  (118) 

64  X  32  X  32 

125  (274) 

112  (229) 

64  X  64  X  32 

246  (539) 

212  (451) 

64  X  64  X  64 

516  (1069) 

458  (894) 

4 

32  X  32  X  32 

30  (82) 

27  (66) 

64  X  32  X  32 

61  (151) 

50  (128) 

64  X  64  X  32 

117  (291) 

102  (237) 

64  X  64  X  64 

.239  (567) 

200  (466) 

128  X  64  X  64 

475  (1126) 

414  (926) 

8 

32  X  32  X  32 

16  (52) 

12  (39) 

64  X  32  X  32 

27  (90) 

23  (70) 

64  X  64  X  32 

48  (164) 

43  (133) 

64  X  64  X  64 

116  (307) 

97  (246) 

128  X  64  X  64 

229  (593) 

191  (479) 

128  X  128  X  64 

505  (1181) 

382  (962) 

16 

64  X  64  X  64 

87  (177) 

70  (138) 

128  X  64  X  64 

144  (326) 

134  (258) 

128  X  128  X  64 

265  (624) 

242  (501) 

128  X  128  X  128 

502  (1228) 

462  (998) 

32 

128  X  64  X  64 

111  (204) 

89  (148) 

128  X  128  X  64 

191  (373) 

158  (278) 

128  X  128  X  128 

318 (687) 

286  (532) 

256  X  128  X  128 

586  (1511) 

538  (1240) 

All  the  timing  results  reported  are  in  miliseconds.  For  convenience,  the  timings  for  the  Row- 
Column  method  implementation  for  the  corresponding  data  sizes  are  givem  m  parentheses, 
timel  =  in  place,  ms.  Aware  time2  =  transposed,  ms.  Aware 
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3.5  Implementation  results  on  IBM  SP2 

timel  refers  to  the  time  required  to  perform  a  forward  2D  FFT  in  which  the  node  distribution 
of  the  output  coincides  with  that  of  the  input.  time2  is  the  corresponding  time  when  the  results 
are  obtained  in  a  transposed  fashion,  i.e.  they  are  obtained  in  nodes  different  than  that  where 
the  data  where  originally  stored.  All  times  are  being  measured  by  using  the  mclockQ  function 
call  and  they  are  reported  in  miliseconds. 


Size  (n  x  m) 

nodes 

timel  (ms) 

time2  (ms) 

1024  X  1024 

4 

380 

280 

1024  X  1024 

8 

190 

150 

1024  X  1024 

16 

no 

80 

1024  X  1024 

32 

60 

40 

1024  X  1024 

64 

40 

30 

2048  X  1024 

8 

410 

310 

2048  X  1024 

16 

210 

160 

2048  X  1024 

32 

130 

90 

2048  X  1024 

64 

80 

50 

2048  X  2048 

16 

450 

310 

2048  X  2048 

32 

240 

170 

2048  X  2048 

64 

150 

100 

Table  SP2-1:  Time  required  for  forward  2D  complex  single  precision  FFT. 


From  the  timings  reported  in  Table  SP2-1  we  see  that  each  global  matrix  transposition 
requires  approximateUy  25  In  the  case  of  the  in-place  paraUel  2D  FFT  (i.e.  when  the  node  dis¬ 
tribution  of  the  results  coincides  with  that  of  the  data),  about  50  of  the  total  time  is  needed  for 
the  inter-processor  communication  and  local  data  transpositions.  This  suggests  that  substantial 
improvements  could  be  achieved  by  usings  asynchronous  communication  caUs  to  interleave  node 
computations  with  data  communications. 


In  Table  SP2-2  we  report  timings  for  the  case  of  paraUel  3D  FFTs.  Again,  hmel  refers  to 
the  “in-order”  case  and  time2  refers  to  the  “out-of-order”  case. 
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Size  {n  X  m  X  k) 

nodes 

timel  (ms) 

time2  (ms) 

128  X  128  X  64 

4 

390 

270 

128  X  128  X  64 

8 

210 

140 

128  X  128  X  64 

16 

130 

70 

128  X  128  X  64 

32 

80 

40 

128  X  128  X  64 

64 

60 

20 

128  X  128  X  128 

8 

420 

270 

128  X  128  X  128 

16 

230 

150 

128  X  128  X  128 

32 

150 

90 

128  X  128  X  128 

64 

100 

80 

Table  SP2-2:  Time  required  for  forward  3D  complex  single  precision  FFT.  Again  as  it  for 
the  case  of  the  2D  paraUel  FFT,  inter-processor  communication  requires  a  substantial  part  of 
the  total  FFT  time. 
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3.6  RTA  multi-processor  codes 

For  this  set  of  codes,  each  node  is  assumed  to  store  in  its  local  memory  the  whole  data  set. 
Each  node,  performs  CRT  and  the  corresponding  periodization  and  then  computes  3D  DFT. 
The  timings  do  not  include  the  final  data  re-indexing  stage. 


nodes 

size 

time  (ms) 

4 

192 

X 

192 

34.79 

384 

X 

192 

71.35 

384 

X 

384 

■  162.25 

768 

X 

384 

326.89 

768 

X 

768 

715.19 
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3.7  Implementation  results  for  Gabor  coefficients 


Table  1.  Timing  Results  on  i860  Single  Node  (Critical  Sampling  -  2^) 


Sample  Size  n 

2-D  L  X  M 

Time  ms  =  10  ^sec. 

256 

16  X  16 

0.67 

512 

16  X  32 

1.20 

1024 

32  X  32 

2.02 

2048 

32  X  64 

3.98 

4096 

64x  64 

7.41 

8192 

64  X  128 

14.96 

16384 

128  X  128 

29.82 

32768 

128  X  256 

60.89 

65536 

256  X  256 

125.55 

131072 

256  X  512 

264.60 

262144 

512  X  512 

566.99 

Table  2.  Timing  on  i860  Single  Node  (Critical  Sampling  -  Mixed  sizes) 


Sample  Size  n 

2-D  Lx  M 

Time  ms  =  10  ^sec. 

384 

8  X  48 

1.47 

768 

16x  48 

1.99 

1536 

32  X  48 

3.12 

3072 

64x  48 

5.91 

3072 

128  x  24 

6.15 

6144 

128  X  48 

12.07 

6144 

64  X  96 

12.48 

12288 

512  X  24 

26.07 

12288 

128  X  96 

24.05 

24576 

256  X  96 

48.70 

49152 

256  X  192 

98.71 

98304 

256  X  384 

203.52 

98304 

512  X  192 

209.12 

196608 

512  X  384 

433.41 

393216 

1024  X  384 

1011.61 

Implement  Sit  ion  Results 

Table  3.  Timing  Results  on  i860  Single  Node  (Integer  Oversampled) 


Sample  Size  n  =  L'M  2-D  L'  x  M  Time  ms 


32  X  16 


32  X  32 


64x  32 


64x  64 


8192 

128  X  64 

16384 

128  X  128 

32768 


65536 

131072 

262144 

524288 


256  X  128 


256  X  256 
512  X  256 
512  X  512 
1024  X  512 


60.89 

125.55 

264.60 

566.99 


Table  4.  Timing  on  i860  Single  Node  (Fractional  Oversampling  (3/2)) 


Sample  Size  n 

2-D  Lx  M 

Time  ms  =  10 

384 

16  X  24 

2.06 

768 

32  X  24  . 

2.97 

768 

16  X  48 

3.91 

1536 

64x  24 

5.31 

1536 

32  X  48 

6.03 

3072 

64  X  48 

10.79 

3072 

128  X  24 

10.05 

6144 

128  X  48 

20.85 

6144 

64  X  96 

22.86 

12288 

128  X  96 

43.15 

24576 

256  X  96 

84.71 

49152 

256  X  192 

171.39 

98304 

256  X  384 

412.12 

98304 

512  X  192 

413.50 

196608 

512  X  384 

840.02 
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Table  5.  Timing  on  i860  Single  Node  (Fractional  Oversampling  (5/4)) 


Sample  Size  n 

2-D  Lx  M 

Time  ms  =  10  ^sec. 

320 

8  X  40 

2.82 

640 

16  X  40 

3.85 

1280 

32  X  40 

5.66 

1280 

16x  80 

7.66 

.... 

2560 

64  X  40 

9.65 

2560 

32  X  80 

11.35 

5120 

128  X  40 

16.42 

5120  • 

64x  80 

18.32 

5120 

32  X  160 

22.49 

10240 

128  X  80 

32.09 

10240 

64  X  160 

37.99 

20480 

128  X  160 

67.65 

20480 

64  X  320 

74.42 

40960 

128  X  320 

134.08 

81920 

256  X  320 

258.40 

81920 

128  X  640 

276.69 

163840 

512  X  320 

522.19 

163840 

256  X  640 

534.90 

327680 

512  X  640 

1149.76 

A  Formulating  Data-Partition  and  Migration  in  Dis¬ 
tributed  Memory  Multiprocessors 

Abstract 

This  paper  presents  an  algebraic  framework  for  expressing  data-partition  and  mi¬ 
gration  in  distributed  memory  multiprocessors  in  terms  of  the  algebra  of  stride 
permutations.  This  algebra  provides  powerful  tools  for  visualizing  the  cost  of  com¬ 
munication  in  parallel  computations  and  for  minimizing  this  cost  by  straightfor¬ 
ward  algebraic  manipulations.  We  demonstrate  the  significance  of  this  tool  and 
show  how  it  leads  to  significant  performance  gains  on  Intel’s  Touchstone  systems 
(Delta,  iPSC/860  and  Paragon)  in  three  examples:  matrix  transpose  algorithm, 
two-dimensional  discrete  Fourier  transform  algorithm,  and  solution  of  Euler  partial 
differential  equations  using  wavelet- Galer kin  method. 

A.l  Introduction 

It  is  well  known  that  data-distribution  in  distributed  memory  multiprocessors  is  essential  to 
achieve  high  performance  of  data-parallel  programs.  Extensive  research  has  been  reported  on 
data-decomposition  optimization  for  distributed  memory  machines  [1,  2,  3,  4,  5].  Research 
in  this  area  can  be  crudely  classified  into  two  categories.  One  aims  at  finding  optimal  data- 
partitioning  schemes  for  parallel  loop  constructs  as  part  of  compiler.  It  has  been  shown  that  the 
problem  of  finding  an  optimal  data-partition  is  NP-complete  [3,  6,  1].  Therefore,  researchers 
have  to  rely  on  heuristic  methods  [6,  7,  8,  2,  9].  The  other  effort  aims  at  special-purpose 
implementations  and  a  large  work  force  for  developing  optimal  implementation  of  individual 
algorithms  is  reported  [10,  11,  12]. 

Typically,  an  application  requires  a  number  of  computation  modules  linked  together  to  accom¬ 
plish  a  specific  computation.  Global  optimization  depends  not  only  on  optimal  implementation 
of  the  computational  modules,  but  at  least  equally  on  the  interface  between  these  implementa¬ 
tions  as  determined  by  the  data  partition  and  migration  across  processors. 

In  this  paper,  we  present  a  systematic  formulation  for  data-partition  and  migration  on  dis¬ 
tributed  memory  multiprocessors  in  terms  of  tensor  product  notation  and  stride  permuta¬ 
tions.  Data-partition  and  migration  are  represented  using  simple  tensor  algebraic  expressions 
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highlighting  the  computational  and  communication  complexity  of  parallel  algorithms.  There¬ 
fore,  optimal  data-partition  and  migration  at  interfaces  between  different  algorithms  becomes 
straightforward  tensor  algebraic  manipulations  with  the  aid  of  well-established  theorems  in  this 
field.  Furthermore,  due  to  the  conciseness  of  the  underlying  algebra,  definitions  are  simple  and 
compact  without  having  to  deal  with  complicated  indices  in  complex  data  structures. 

In  order  to  demonstrate  the  significance  and  usefulness  of  our  framework,  we  have  carried 
out  experiments  on  existing  distributed  memory  multiprocessors  such  as  Intel’s  Paragon,  and 
Touchstone  Delta.  Initially,  our  formal  definitions  are  incorporated  in  three  application  prob¬ 
lems:  matrix  transpose  algorithm,  two  dimensional  discrete  Fourier  transform  algorithm,  and 
solution  of  Euler  partial  differential  equation  using  wavelet- Galerkin  approach.  Then,  simple 
algebraic  manipulations  on  these  expressions  are  carried  out  to  derive  optimal  data-partition 
and  migration  schemes.  Experimental  timing  results  on  these  machines  show  that  such  simple 
algebraic  manipulations  result  in  performance  improvement  ranging  from  30%  to  600%. 

The  rest  of  the  paper  begins  with  a  simple  introduction  to  tensor  notation  and  stride  per¬ 
mutations  as  a  background  of  our  work.  In  Section  3,  we  present  our  formal  definitions  for 
data-partition  and  migration  in  distributed  memory  multiprocessor  systems.  Experiments  on 
Intel’s  distributed  machines  and  discussions  on  numerical  results  are  presented  in  Section  4. 
Section  5  discusses  the  related  work  in  the  field  with  respect  to  our  model.  We  conclude  the 
paper  in  Section  6. 

A. 2  Preliminaries 

In  this  subsection,  we  review  and  describe  necessary  notation  and  terminology  that  will  be  used 
throughout  the  paper. 

A. 2.1  Stride  Permutations 

A  vector  x  is  an  ordered  finite  linear  array.  The  dimension  of  x,  denoted  by  dim(x),  is  the 
number  of  elements  in  the  linear  array.  Let  dim(x)  =  LS,  for  positive  integers  L  and  5.  Stride 
permutations  are  natural  way  of  representing  data-shuffling  operations.  We  use  P(T5,  S)  to 
represent  the  stride  permutation  operation  on  a  vector  of  length  LS  with  stride  S.  To  define 

P(LS,  S),  set 


y  =  P(LS,S)x. 
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The  first  L  elements  of  y  are  obtained  by  collecting  elements  of  x  starting  at  element  xq  and 
then  striding  through  x  in  steps  of  size  5,  i.e.,  [xo,X5,. .  .X(l,_i)5].  The  next  L  elements  of  y 
are  obtained  in  the  same  way  starting  at  Xi  of  x:  [xi,  xs+i,  •  •  - ,  a^(L,-i)5+i],  and  so  on.  We  can 
represent  the  stride  permutation  P{LS,  S)  by  a  permutation  matrix  which  we  will  denote  also 

by  P{LS,  S). 


Example  A.l 

given  by 


Permutation  matrix  P(6,3)  operating  on  vector  x  —  [xq  Xi  X2  X3  X4  Xs]^;  is 


P(6,3)x 


1  0  0  0  0  0 

Xo 

0  0  0  1  0  0 

Xi 

0  1  0  0  0  0 

X2 

0  0  0  0  1  0 

Xs 

0  0  1  0  0  0 

X4 

0  0  0  0  0  1 

X5 

(1) 


A. 2. 2  Tensor  Product 


Tensor  product  is  a  binary  operator  between  two  matrices  of  any  size.  Given  two  matrices  A  and 
B  of  sizes  Ma  x  Na  and  Mb  x  Nb,  respectively,  a  new  matrix,  C,  dimensioned  MaMb  x  NaNb 
can  be  generated  by  tensor  product  of  A  and  B  as: 


C  =  A2)B  = 

<1(0, o)B 
<i(i,o)B 

<1(2, 0)B 

<1(0, i)B 
<1(1, i)B 
<1(2,1)B 

<1(0,2)B 

<1(1,2)B 

<1(2,2)B 

<i(o,iv^-i)B 

<i(i,Jv^-i)B 

<l(2,iVA-l)B 

<1(Ma-i.o)® 

<1(A/a-1,1)® 

<l(iUA-l,2)B  • 

•  •  <i(MA-i.iVA-i)®  , 

where  a(ij)  is  the  element  on  the  ith  row  and  jth  column  of  A,  and  is  scalar-matrix 

multiplication. 


Example  A. 2  Consider  the  following  two  matrices: 


A  = 


1  2 
3  4 


and  B  = 


10  11  12 

13  14  15 


Then 


■  10  11  12 

20  22  24  ■ 

B 

2B  ' 

13  14  15 

26  28  30 

3B 

4B 

30  33  36 
_  39  42  45 

40  44  48 

52  56  60  . 

C  =  A  (2)  B  = 
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according  to  equation  (2). 


Two  types  of  tensor  products  are  of  special  interest  to  us  here.  One  has  an  identity  matrix 
on  the  left-hand  side  of  a  tensor  product,  called  prior  identity  matrix,  and  the  other  has  an 
identity  matrix  on  the  right-hand  side,  referred  to  as  post  identity  matrix. 

Denote  the  N  x  N  identity  matrix  by  Im-  For  an  M  x  M  matrix  A,  ®  A  denotes  the 
MN  X  MN  block-diagonal  matrix 


Example  A. 3  Consider  a  ^-processor  machine  and  the  butterfly  matrix  A 


A  = 


1 

1 


Then,  y  =  (I4  <8)  A)  x  = 


2^0  + 

Xq  Xi 
X2  +  X3 
X2  —  X3 
X4  p  X5 
X4  ^5 

xe  +  ^7 
xe  -  X7 


1 

1 

0 

0 

0 

0 

0 

0  ' 

1 

-1 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

0 

0 

0 

0 

0 

0 

1 

-1 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

0 

0 

0 

0 

0 

0 

1 

-1 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

0 

0 

0 

0 

0 

0 

1 

-1 

Xo 

X2 

X3 

X4 

Xs 

Xq 

Xj 


Each  processor  executes  one  butterfly  on  a  different  part  of  x,  where  the  node  boundaries  are 
represented  by  horizontal  lines.  If  only  2  processors  are  available,  then  we  can  use  the  identity 


I4®  A  =  I2®  {h®  A) 

to  implement  the  computation  where  two  butterflies  are  performed  in  each  processor. 
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Tensor  products  with  prior  identity  matrices  can  be  used  to  represent  parallel  tasks.  In 
general,  on  a  A:-processor  distributed  memory  machine,  execution  of  (I;v  0  A)  would  imply  k 
parallel  tasks,  where  N  =  nk  and  u  is  a  positive  integer. 


If  an  identity  matrix  appears  on  the  right-hand  side  of  a  tensor  product  (post  identity  matrix) 
it  is  performed  in  a  natural  way  as  a  vector  operation. 


O(0,0)lyv  ^^(0,l)IiV  0,(0, 2)^N  ■  ■  • 

Xq 

a(i,0)IiV  0(1,1)!^  «(i,2)IiV  •  •  • 

Xi 

0(2,0)!^  a(2,l)IiV  a(2,2)Iw  •  ■  •  a{2,K-l)^N 

X2 

a(L-i,o)IiV  a(L-i,i)I^  0(1-1, 2)1^^  •  •  •  0(L-i,A-i)IiV  _ 

.  . 

Example  A. 4  Consider  a  vector  computer  with  vector  register  length  equal  to  3,  and  opera¬ 
tional  matrix  defined  as: 


A  = 


a 

c 


b 

d 


and  X  =  [xo  x\  X2  X3  X4  xs]^ 


Then,  y  =  ( A  (g)  I3)  X  = 


axo  +  bx3 

a  0  0  6  0  0 

Xq 

axi  +  6x4 

0  a  0  0  6  0 

Xi 

0X2  +  6x5 

0 

0 

0 

0 

X2 

CXq  T  dx3 

0 

0 

0 

0 

0 

X3 

CXi  +  dx4 

0 

0 

0 

0 

X4 

CX2  -1-  dxs 

0 

0 

Ci 

0 

0 

1 _ 

xs 

is  performed  by  partitioning  input  data  into  two  subvectors  Xi  —  [xq  xi  X2]^  and  X2  —  [X3  X4  X5]  , 
and  with  the  vector  operations:  yi  =  oxi  -b  6x2  and  y2  =  cxi  +  dx2- 


A. 2. 3  Some  Useful  Theorems 

Tensor  product  identities  provide  powerful  tool  developing  variants  of  an  algorithm.  We  will 
present  these  properties  without  proofs,  for  there  are  many  texts  containing  the  proofs  on  diverse 
levels  including  [13,  14].  We  use  the  convention  that  a  complex  tensor  product  formulation 
should  be  read  from  right  to  left. 
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Theorem  A.l  Multiplication  of  Tensor  Products:  If  Nx  =  Ma  and  Ny  =  Mb,  then 
the  following  multiplication  theorem  holds  true. 

(XwxXiVx  ®  YA/yxiVy)  ®  B^fiXiVs) 

=  {^MxxNx  ^Ma^Na)  ®  (YwyxA^y  BmbxNb)  (4) 

This  theorem  is  quite  often  used  to  derive  parallel  or  vector  computations  when  identity  matrices 
appear  in  the  product. 

Theorem  A. 2  Commutative  Law: 

(-A-a/^xVa  ®  Byv/gx  Nb)  =  P{MaMb,  Ma)  (BmbxNb  ®  ^MaxNa)  P{^a^b,  Nb)  (5) 

This  theorem  is  quite  useful  in  generating  different  communication  structures  of  an  algorithm. 

Theorem  A. 3  Inverse  of  Tensor  Products:  Unlike  the  case  in  multiplication  of  two  ma¬ 
trices,  inverse  of  tensor  product  of  two  matrices  does  not  change  the  order  of  its  parameters. 

(A®B)-' =  (A-i(g)B-')  (6) 

Theorem  A. 4  Multiplication  Theorem  of  Stride  Permutations:  Any  simple-stride  per¬ 
mutation  can  be  decomposed  into  two  stride  permutations  when  stride  is  a  multiple  of  two 
integers. 

P{NiN2N^,N,N2)  =  P{N,N2N2,Nt)P{NiN2Ns,N2)  (7) 

Theorem  A. 5  Inverse  Stride  Permutation: 

P(iViY2,iVi)-'  =  P{N^N2,N2).  (8) 

Theorem  A. 6  Parallel- Vector  Tensor  Factorization  of  Stride  Permutations: 

F(yVi  A2A3,  As)  =  [P{NiN3,  As)  0  Iv^]  [Ivi  ®  P(A2As,  As)]  (9) 

This  is  one  of  the  very  important  theorems  to  uncover  the  extent  of  communication  complexity 
hidden  in  a  permutation.  When  parameter  Ai  is  an  integral  multiple  of  number  of  processing 
elements,  this  theorem  extracts  parallel  local  operations  from  operations  that  depend  upon 
non-local  data.  A  stride  permutation  can  also  be  factored  in  a  different  way  (inverse  of  theo¬ 
rem  (A.6))  leading  to  the  following  theorem. 

Theorem  A. 7  Vector-Parallel  Tensor  Factorization  of  Stride  Permutations: 


P(Ai  AsAs,  Ai  A2)  =  [Iv:  ®  7^(A2 As,  A2)]  [P(Ai  As,  Ai)  O  I^v,] 


(10) 
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A.3  Data  Partition  and  Migration:  Formal  Definitions 

A. 3.1  Storing  Data  in  Distributed  Memories 

Most  large  scale  applications  of  scientific  computing  involve  manipulations  of  data  that  are 
expressed  in  terms  of  matrices  and  vectors.  This  is  natural  because  matrix  notation  gives  a 
compact  way  to  express  computation.  Moreover,  storing  matrices  or  vectors  in  the  memory  of 
a  computer  system  is  the  first  step  of  any  computation.  Different  ways  of  storing  data  may 
result  in  different  algorithmic  structures  as  well  as  different  computational  performance.  While 
methodology  and  algebraic  formulations  for  storing  matrices  in  a  linear  memory  space  of  a 
single  processor  system  exist,  such  as  row-major  and  column-major,  there  is  neither  a  formal 
and  commonly  agreed  way  of  addressing  data  stored  in  distributed  memory  multiprocessor 
systems,  nor  an  agreed  formal  description  for  various  storage  schemes.  Programmers  for  parallel 
machines  usually  organize  data  in  a  way  based  on  their  convenience  and  efficiency  of  a  specific 
algorithm.  As  a  result,  data-allocation  and  partition  in  parallel  processing  are  very  diversified. 
Therefore,  there  is  a  need  for  a  unified  approach  for  formalizing  data  allocation  and  partitioning 
in  parallel  machines,  and  for  a  clear  and  convenient  mathematical  representation  of  various  data¬ 
storage  schemes.  In  parallel  computers,  particularly  in  distributed  memory  multiprocessors, 
communication  costs  are  directly  related  to  various  data  storage  schemes.  Clear  representation 
of  storage  schemes  helps  parallel  programmer  greatly  to  look  into  structures  of  implementations 
and  communication  costs  associated  with  algorithms. 

Consider  a  message-passing  multiprocessor  system  with  k  processors  labeled  from  0  to  -  1, 
where  k  =  ki  k^.  We  would  like  to  partition  and  store  a  two-dimensional  (2D)  matrix,  denoted 
by  A  onto  this  system.  For  the  purpose  of  simplicity  and  clarity  of  our  presentation,  we  present 
only  the  cases  where  the  data  can  be  evenly  divided  into  k  subsets  and  concentrate  on  our  main 
interest  of  algebraic  representation  of  partitioning  the  matrix  and  storing  them  into  processors’ 
memories.  In  the  following,  we  assume  that  the  operator  VectMNi-^)  maps  an  M  x  N  size 
two-dimensional  array.  A,  into  a  MiV-length  single  dimension  array,  a,  where  (i,i)th  element 
of  A  is  mapped  to  (;  -  1)  M  -f  zth  element  of  a  (column-major). 

Definition  A.l  Row-Division;  Let  A  be  an  M  x  N  matrix.  We  define  row-division  onto 
k  processors  as  follows.  Partition  A  into  k  sets  of  complete  rows  such  that  i-th  set  of  rows 
(top-down)  is  allocated  to  i-th  processor.  In  matrix  notation,  row-division  can  be  represented 
as  operating  by 


PR{M,N,k)  =  P{Nk,k)®l^f/k 


(11) 
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on  a  vector  a  that  is  formed  as  VectuNi^)- 

We  use  bold  faced  “P”  (P)  with  appropriate  subscript  to  represent  our  data-partition  def¬ 
initions  while  italic  “P”  (P)  to  represent  operation  of  stride  permutation  explained  in  sec¬ 
tion  (A. 2.1). 

Definition  A.2  noliiinn-Division;  Let  A  he  an  M  x  N  matrix.  We  define  column-division 
onto  k  processors  as  follows.  Partitioning  matrix  A  into  k  sets  of  complete  columns  such  that 
i-th  set  of  columns  (left-right)  is  allocated  to  i-th  processor.  In  matrix  notation,  column- division 

is  represented  as  operating  by 

PciM,N,k)=lMN  (12) 

on  a  vector  a  that  is  formed  as  Vect^Ni-^)- 

Definition  A.3  Mpsh-Division;  Let  A  be  an  M  x  N  matrix.  We  define  mesh-division  of 
A  onto  a  system  of  ki  x  k2  processors  as  follows.  Partition  M  rows  of  A  into  ki  equal  sets 
of  rows  (top-down)  and  then  partition  each  set  of  rows  into  k2  equal  subsets  (left-right).  Each 
subset  is  a  M/h  x  N/k2  size  matrix  but  will  have  neither  complete  rows  nor  complete  columns. 
Allocation  of  these  subsets  to  k  processors  is  performed  anti-lexicographically  (top-down  and 
then  left-right).  In  matrix  notation,  mesh-division  is  defined  as 

PM{M,N,ki,k2)  =  lk2®  P{Nki/k2,ki)®lM/h-  (13) 

Definition  A.4  Cyclic-Division;  Let  A  be  an  M  x  N  matrix.  We  define  cyclic- division  of 
A  onto  k  processors  as  follows.  Partition  the  vector  VectMN{A.)  into  {MNjk)  consecutive 
subvectors  such  that  i-th  element  of  each  subvector  is  allocated  to  i-th  processors.  In  matrix 
notation,  cyclic-division  can  be  represented  as  operating  by 

PcyciM,N,k)  =  P{MN,k)  (14) 

on  a  vector  a  that  is  formed  as  17ect^y(A). 

Definition  A.5  Block-Cyclic-Division:  Let  A  be  an  M  x  N  matrix.  We  define  block-cyclic- 
division  of  A  onto  a  system  with  k  processors  as  follows.  Partition  the  vector  VectuNi^)  into 
{MNjS)  number  of  S -length  consecutive  subvectors  and  assign  [i  (mod  k)]-th  subvector  to  i-th 
processor.  In  matrix  notation,  block- cyclic- division  can  be  represented  as  operating  by 

Pbc{M,  N,  k)  =  P{MN/S,  k)  ®  Is  (15) 


on  a  vector  a  that  is  formed  as  VectMN^^)- 


Data  Partition  and  Migration 


68 


Block-cyclic  is  similar  to  cyclic  except  that  each  time  S  elements  are  allocated  to  a  processor  in¬ 
stead  of  one  element.  Also  note  that  column-division  can  be  obtained  from  block-cyclic-division 
for  the  case  of  M/k  =  S,  that  is,  number  of  rows  assigned  to  each  processor  in  matrix  A  is 
equal  to  the  length  of  subvector  in  block-cyclic-division. 

Following  five  equations  represent  inverse  operations  of  the  above  five  definitions  which  can  be 


derived  using  theorems  (A. 3)  and  (A. 5). 

p-^\M,N,k)  =  P{Nk,N)®lM/k  (16) 

PQ^{M,N,k)  =  Imn  (ll") 

Pjl{M,N,h,k2)  =  Ik2®P{Nh/k2,Nlk2)®lM/h  (18) 

FdlciM,N,k)  =  P{MN,MNIk)  (19) 

Pl};{M,N,k)  =  PiMN/S,MN/{Sk))®ls  (20) 


Example  A. 5  This  example  demonstrates  data-partitioning  of  an  8  x  S  matrix,  A,  onto  a  f- 
processor  machine.  Figure  1  shows  how  a  64-element  vector  a  formed  by  Vecte4{A)  is  par¬ 
titioned  in  row-division,  column- division,  and  mesh-division  based  on  definitions  (A.1)-(A.3). 
In  case  of  row- division,  h  on  the  right-hand  side  o/Pfi(8,8,4)  represents  moving  vectors  of 
length  2  according  to  the  permutation  matrix  P(32, 4).  When  this  permutation  is  applied,  re¬ 
sulting  data  at  processor-0  is  shown  with  dotted-line.  For  column-division  data-partitioning, 
since  input  permutation  is  an  identity  matrix,  no  action  needs  to  be  performed,  and  the  vector 
a  is  just  segmented  into  four  parts  for  allocating  to  four  processors.  For  mesh-division  data- 
partitioning,  I2  on  the  left-hand  side  o/Pa/(8,  8, 2, 2)  represents  an  action  to  divide  the  vector  a. 
into  two  equal  sets  and  perform  the  vector-stride  action  P(8,2)  ®  I4  on  each  set.  However,  this 
vector-stride  further  divides  each  set  into  eight  small  subvectors  of  length  4  and  shuffle  them 
according  to  the  permutation  P{8,2).  Once  again,  data  residing  at  processor-0  after  the  action 
of  input  permutation  is  shown  with  dotted-line. 


General  Usage  of  Data-Partition  Definitions 


Consider  any  computational  procedure  that  is  expressed  by  an  operational 
on  a  vector  a  to  obtain  vector  b: 

b  =  Ga. 


matrix  G  operating 


(21) 


This  equation  ignores  the  underlying  data-partition  necessary  to  carry  out  the  computation  in 
distributed  memory  multiprocessor  system.  To  bring  out  the  data-partition,  let  a(=  Qia)  be  a 
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Figure  1:  Action  of  data-partition  algebraic  expressions  onto  a  4-processor  machine 
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desired  data-partition  of  a  among  the  processors  where  Qi  is  one  of  the  data-partition  schemes 
or  Pm)  defined  above.  If  one  expects  the  output  data  to  be  in  a  particular  partition 
after  the  computation,  then  resultant  data  is  of  the  form  b  where  b  =  Q2b  and  Q2  is  also  one 
of  the  definitions  Pr,  Pa,  or  Pm  defined  above.  Then  parallel  implementation  corresponding 
to  equation  (21)  after  incorporating  our  definitions  can  be  rewritten  as; 


b  =  Q2  b  =  Q2  G  a  —  [Q2  G  Qj  ^ 


(22) 


Therefore,  G  =  Q2  G  is  the  actual-operational  matrix  that  takes  into  account  of  the 
complexity  associated  with  the  considered  data-partition. 


A. 3. 2  Moving  Data  among  Distributed  Memories 

Once  input  data  is  partitioned  among  the  processors,  data  migrations  at  the  interfaces  between 
individual  algorithms  may  be  necessary  in  order  to  achieve  global  optimal  performance  of  an 
application.  One  frequently  used  data  migration  in  numerical  applications  is  well  known  matrix 
transpose.  Let  a  =  7ectM/v(AMxiv),  and  b  =  V"ect,vM(B/vxiV/),  where  BiVxM  is  the  transpose 
of  Amx/v-  Then, 

b  =  P(MiV,iV/)  a.  (23) 

Hence  P{MN,  M)  is  the  operational  matrix  for  transpose  algorithms,  that  is,  G  =  P{MN,  M). 
When  data-partition  schemes  are  to  be  incorporated,  the  actual-operational  matrix  becomes  G 
(see  equation  (22)).  That  is, 

P(MyV,  M)  =  Qj'  G  Qi,  (24) 


and  the  equation  (23)  becomes 


b  =  Ga, 


(25) 


where  G  =  Q2  P{MN,M)  In  the  following,  we  present  derivations  for  the  operational 

matrices,  G,  required  to  transpose  a  matrix  for  the  cases  of  row-,  column-,  and  mesh-division 
data-partitions  defined  in  previous  subsection  (assume  Qi  =  Q2  for  simplicity),  and  visualize 
their  implementation  aspects  from  their  tensor  product  formulations. 


Row-Division 

For  row-division  data-partition,  we  have 


G  =  PRiN,M,k)  P{MN,M)  PR\M,N,k). 


(26) 
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According  to  definition  (A.l),  we  have 

G  =  [P{Mk,  k)  ®  I/v/fc]  P{MN,  M)  [P(A'^,  iV)  ®  I,v//fc  ,  (27) 

(or) 

G  =  P{MN,  M)  =  \P{Mk,  M)  ®  I^v/A:]  G  [D(iVfc,  k)  ®  I,v//A:]  •  (28) 

Then,  we  can  obtain  expression  for  G  by  rewriting  G  =  P{MN,  M)  as: 

P{ MN,  M)  =  \P{Mk,  M)  ®  lyv/fc]  [I*  ®  P{MNIk,  M)] 
by  theorem  (A. 6) 

P{MN,  M)  =  M,  k)  [lk2  ®  P{MNlk‘^,  M/A;)] 

(Ifc  0  P{N,  k))  0  iM/k 

by  theorem  (A. 7)  and  equation  (16) 

P{MN,  M)  =  Pr  (^V,  M,  k)  [ifc  0  Ifc  0  P{MNIk\  M/A:)] 

^P{k^,  k)  0  liv/fc  0  Im/js:]  [■P(-^^)  ^)  ®  Im/a] 
by  applying  theorem  (A. 6)  to  P{Nk,  k) 

P{MN,  M)  =  Pr  (tV,  M,  k)  [ifc  0  Ifc  0  P{MNIk\  M/A:)] 

[P(A;^,  A:)  0)  lMiv/fc2  Pr(M,  iV ,  A:)  (29) 

by  definition  (A.l) 

P{MN,  M)  =  Pr  G  Pr. 

Therefore,  the  actual-operational  matrix  in  equation  (25)  for  row-division  partition  can  be 
expressed  as  two  stages: 

G  z=  ifc  0  Ifc  0  P(M/7/A:^  M/A;)]  [P(A:^  k)  0 1,v/Ar/fc2  •  (30) 

The  first  stage,  P(A;^  A;)  0  Ijv/N/fc^,  is  a  global-task  involving  mess  age- passing  among  processors, 
since  the  expression  does  not  contain  an  identity  matrix,  Ifc,  on  its  left-hand  side.  The  size  of 
each  message  being  passed  is  (MiV/A:^)  which  is  (l/^:)th  of  the  size  of  the  data  set  residing  at 
a  processor.  This  is  reflected  in  the  above  tensor  product  expression  by  I^iV/fc^-  The  factor 
P{k^,  k)  in  the  expression  indicates  that  each  processor  has  (A:  -  1)  subblocks  to  send  out.  Such 
message  passing  is  carried  out  in  (A:  - 1)  stages  with  one  subblock  being  kept  within  a  processor. 
When  the  number  of  processors,  A:,  is  a  2-power  integer,  one-to-one  communication  structure 
can  be  obtained  with  xor  binary  operator  and  a  pseudo-code  implementation  for  this  stage  is 

shown  in  Table  1. 
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me  =  my  node  number 
for  index  =  1  to  A:  —  1 

myswap  =  xor(me, index) 

Send  block-myswap  of  my  associated  vector  a  to  processor-myswap 

Receive  message  from  processor-myswap 

Store  message  at  block-myswap  of  my  associated  vector  a 

end  _  _ _ _ _ _ 

Table  1:  Psuedo-code  for  message  passing  in  transpose  algorithms  either  for  row-division  or 
column-division  partitions 

The  second  stage,  h  ®lk  ®  P{MNIk\  Mjk),  represents  a  local-task  due  to  the  identity 
matrix  Ifc  on  its  left-hand  side.  Each  processor  performs  the  parallel-stride  operation 
P{MN/k\M/k)]  locally. 

Column-Division 

For  column-division  data-partition,  we  have 

G  =  Pc{N,  M,  k)  P{MN,  M)  (M,  iV,  k)  (31) 

According  to  definition  (A. 2),  we  have 

G  =  Imn  P{MN,  M)  Imn  =  P{MN,  M)  =  G.  (32) 

Then,  we  can  obtain  expression  for  G  as: 

P{MN,  M)  =  [Ifc  ®  P{MNIk,  M/k)]  [P{Nk,  k)  ®  I,vf/fc 

by  theorem  (A. 6) 

P{MN,M)  =  [Ik®  PiMN/k,M/k)][{[P{k^  ,  k)  0  Tv/fc  )  (Ifc  ®  P(d^,  ^))}  ®  Iw/fc] 

by  theorem  (A. 7) 

P{MN,M)  =  [Ik®  P{MN/k,M/k)]\P{k'^,k) 

[ifc  ®  P{N,  k)  ®  Iw/fc]  • 

Therefore,  the  actual-operational  matrix  in  equation  (25)  for  column-division  partitioning  can 
be  expressed  as  three  stages: 

G  =  [Ifc  0  P{MNjk,  Mjk)]  [P(A:^  k)  ®  iMV/fc^]  [ifc  ®  P{^ .  ®  Wfc] 


Data  Partition  and  Migration 

The  first  stage,  h  0  P{N,  k)  0  Tw/fc,  represents  /oca/ data  permutations  without  message-passing 
due  to  the  prior  identity  Ifc.  Each  processor  performs  the  vector-stride  operation  [P(iV, 
which  moves  N  vectors  with  stride  k,  each  vector  is  of  length  (M/k). 

The  second  stage,  P(fc^  fc)  <8)  I^/AyP,  is  a  global-task  that  is  similar  to  message-passing  stage 
explained  in  row-division  transpose  algorithm.  Hence  the  total  communication  is  again  (^  —  1) 
messages,  each  message  is  of  length  [MNjk  ). 

The  final  stage,  ®  P{MNIk,Mfk),  is  a  and  simple-stride  permutation  stage  with  stride 
(M/k)  local  to  each  processor.  All  processors  carry  out  the  same  operation  in  parallel  without 

communication. 

Mpsh-Di  vision 

For  mesh-division  partition,  we  have 

G  =  Pm{N,  M,  k2,k^)P{MN,  M)FJI{M,  N,kuk2).  (35) 

According  to  definition  (A. 3)  we  have 

G  =  [ik,  0  P{Mk2lkr,k2)  0  iN/k,]  P{MN,  M)  O  P{Nk,lk2,  Nlk2)  0  iM/k,]  ,  (36) 

(or) 

P{MN,  M)  —  [ifci  0  P{Mk2/ki,  Mfki)  0  lN/k2  G  Ifc2  0  P{Nki/k2,ki)  0  iM/ki  ■  (37) 

Then  we  can  obtain  expression  for  G  by  decomposing  G  =  P{MN,  M)  as  follows. 

P{MN,M)  =  \lk^0  P{MNIki,Mlki)\\P{Nkx,ki)  0lMik^^ 

by  theorem  (A. 6) 

P{MN,  M)  =  [ifci  0  P{Mk2lki,Mlki)  0  Iv/fcj  Pfc  ®  P{MNjk,  M/ki)] 

[P{k,  ki)  0  iMN/k]  [lfc2  ®  P{Nkilk2,  ki)  0  iM/kr]  (38) 

by  theorem  (A.7)  on  P{MNIki,  Mfki)  and  by  theorem  (A.6)  on  P{Nkuki) 

P{MN,  M)  =  k2,ki)  [Ifc  0  P{MNIk,  M/ki )] 

[P{k,ki)0  I^iv/ik]  FM{M,N,kuk2) 
by  equation  (18)  and  definition  (A.3) 

P{MN,  M)  =  Fm{N,  M,  k2,k^)  [P{k,  ki)  0  iMN/k 


(39) 
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[Ik  ®  PiMNjk,  M/ h )]  P,v/(M,  iV,  h ,  ^2) 

by  commutative  law 

P{MN,M)  =  PmGPm 

Therefore,  the  actual-operational  matrix  in  equation  (25)  for  mesh-division  partition  can  be 
expressed  as  two  stages  in  two  different  ways  (equations  (39)  and  (40)): 

(a)  G  =  [Ifc  0  P{MN/k),  M/ki)]  [P{k,  h)  ®  I,v/iV/fc],  and 

G  =  [p{k,h)®lMN/k][h®  P{MN/k,M/k^)]. 

In  case  of  (a),  the  first  stage,  P(fc,fci)0W,  is  a  global-task  involving  message-passing 
since  there  is  no  prior  identity  matrix.  In  fact,  it  is  a  single  message-passing  routme  with 
message  size  being  {MN/k)  as  compared  to  (k  -  1)  messages  each  of  s.ze  {MN/k  )  m  either 
row-division  or  column-division  transpose  algorithms. 

The  second  stage,  It  ®  P(MAr/fc,  M/t,),  represents  that  each  processor  executes  a  local 
simple-stride  permutation  because  of  the  p“rior  identity  matrix  It.  In  fact,  if  we  consider  data 
at  each  processor  to  be  a  matrix  of  size  M/k,  x  N/h,  ‘^en  action  to  be  performed  in  this  stage 
is  k  local  matrix  transposes  that  are  performed  simultaneously  on  k  processors. 

A.3.3  Measured  Timing  of  the  Three  Transpose  Algorithms 

Transpose  algorithms  derived  in  Section  A.3.2  are  implemented  on  Intel's  Paragon.  The  mea¬ 
sured  execution  times  of  the  three  transpose  algorithms  are  tabulated  in  Table  2.  From  t  e 
derivations  in  equations  (30),  (33),  (39),  and  (40),  we  have  seen  that  transposing  a  matrix  of 
size  M  X  W  on  a  I-processor  machine  for  row-  and  column-division  each  requires  (i  - 1)  commu¬ 
nications  with  the  size  of  each  message  being  (MN/k^).  For  mesh-division  on.  communication 
of  size  (MN/k)  is  needed.  Though  message  length  in  mesh-division  is  k  times  more  than  t  at 
of  any  message  in  either  row-division  or  column-division,  results  in  Table  2  show  that  transpose 
algorithm  for  mesh-division  reduces  the  overheads  to  initiate  communications.  Smaller  number 
of  long  messages  can  take  advantages  of  the  pipelined  nature  of  wormhole  routing  [IS).  These 
results  also  show  that  unlike  uniprocessor  algorithms,  variations  in  data-decompositions  can 
have  a  great  impact  on  the  performance  of  an  algorithm. 
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M 

N 

128 

128 

128 

256 

128 

512 

128 

1024 

256 

128 

1 

256 

256 

256 

512 

256 

1024 

512 

128 

512 

256 

512 

512 

512 

1024 

1024 

128 

1024 

256 

1024 

512 

1024 

1024 

Row-Division 

(msec) 

5.236 

5.902 

9.031 

12.356 

5.501 

8.283 

11.483 

20.076 

8.310 

11.555 

18.536 

39.628 

11.228 

17.526 

31.211 

50.936 


Col-Division 

(msec) 

6.172 

7.051 

10.409 

15.312 

6.665 

9.746 

14.027 

22.503 

9.432 

13.359 

21.122 

38.529 

13.132 

20.616 

37.445 

66.403 


Mesh-Division 

(msec) 

1.316 

2.028 

2.159 

3.866 

1.825 

2.301 

4.018 

7.548 

3.450 

5.905 

7.954 

16.434 

5.815 

10.631 

20.889 

49.274 


Table  2;  Experimental  results  of  transpose  algorithms  on  8-node  Intel’s  Paragon. 
Explanation:  Transpose  algorithms  for  Row-division  and  Column-Division  require  seven  small  com- 
munications” while  that  in  mesh  requires  only  one  large  communication.  Effect  of  communication 
overhead  on  transpose  algorithm  clearly  results  mesh-division  more  efficient  than  the  other  cases. 
Among  row-division  and  column-division  structures,  row-division  requires  only  one  local  permutation 
while  column-division  requires  one  local  permutation  before  the  communications  and  another  after 
that.  This  is  also  seen  from  third  and  fourth  columns. 
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A-' 


Figure  2:  Flow  Chart  for  computation  of  coefficients  of  Vorticity 

A.4  Application  Examples 

A. 4.1  An  Application  in  Fluid  Mechanics 

This  application  solves  Euler  partial  differential  equation  using  wavelet-Galerkin  method  [12, 
16].  Figure  2  shows  the  flowchart  for  evaluating  the  coefficients  of  vorticity  in  fluid  mechanics  at 
each  time-step.  The  major  computation  blocks  in  the  figure  are  Jacobian  and  Helmholtz  while 
other  modules  such  as  Error  Check,  computation  of  vorticity  coefficients  in  next  step  (At)  are 
not  time  consuming.  It  is  well  known  that  Jacobian  prefers  mesh-division  data-partitioning 
because  of  boundary  conditions,  that  is,  data  dependency  exists  along  the  four  edges  of  a 
grid.  However,  Intel’s  distributed  memory  machines  have  efficient  two  dimensional  fast  Fourier 
transform  algorithms  (2D-FFT)  based  on  row-  or  column-division  data-partitions.  Therefore, 
switching  between  different  data-partition  schemes  is  necessary  to  carry  out  the  computation 
of  this  application  efficiently.  First,  we  need  to  convert  the  mesh-division  data-partitioning  to 
row-  or  column-division  at  the  output  of  Jacobian  (the  input  of  Helmholtz).  Then,  we  need  to 
convert  back  to  mesh-division  at  the  input  of  the  Jacobian  (output  of  Helmholtz). 

Converting  data-partitioning  schemes  at  the  interfaces  of  different  computational  modules  can 
be  very  expensive  since  it  involves  massive  amount  of  data  movements.  The  communication  cost 
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caused  by  such  data  movements  may  well  dominate  the  total  computation  cost  of  applications 
even  though  individual  algorithms  are  optimized.  With  our  formal  definitions  of  data-partitions, 
however,  manipulation  of  communication  cost  become  straightforward. 


Let  us  consider  the  computation  of  the  Helmholtz  on  a  A:-processor  distributed  memory 
system.  Assume  that  k  can  be  factored  as  =  fc.  x  k,.  The  input  data  to  this  computation 
module  is  in  mesh-division,  which  can  be  represented  by  Pw  With  this  input  format,  we 
perform  2D-FFT  and  its  inverse  (2D-IFFT).  The  summation  form  of  the  2D-DFT  on  matrix 

X  of  size  M  X  N  IS  given  by: 


M-1 


Y(i./)=  E 


7V-1 

E 

m=0  Ln=0 


^  ’  2'irril 


^  •  2irmfc 

e  ^  M  , 


(41) 


The  tensor  products  representation  of  equation  (41)  can  be  written  as. 

y  =  [Fiv  0  Fm[  X,  (42) 

Vi —I  —.1^ 

G 

where  Fj  is  a  J  x  J  matrix  with  entries  F{i,  k)  =  exp(-;27riA:/J),  j  =  7=4,  y  =  i^ectMiv(Y), 
X  =  VectMN(X),  and  G  is  the  operational  matrix. 


To  compute  equation  (42)  on  a  A:-processor  parallel  machine,  we  first  parallelize  the  opera¬ 
tional  matrix  by  inserting  identity  matrices  under  the  assumption  that  k  divides  both  M  and 
N.  There  are  two  ways  of  decomposing  the  equation  (42):  (a)  y  =  [Tv^FmIIFnOIm]  x,  which 
first  computes  Fourier  transforms  on  columns  (using  one  dimensional  FFT  routines)  followed 
by  transforms  on  rows,  and  (b)  y  =  [F^  ®  ®  Fm]  x,  which  performs  transformation  on 

rows  followed  by  that  on  columns.  These  two  decompositions  are  well  known  as  row-column 
decomposition  for  transform  methods.  Consider  the  first  decomposition  (a).  The  factor  on  the 
left-hand  side  represents  a  parallel  computation  of  Fm  because  of  the  preceding  identity  matrix 
while  the  factor  on  the  right-hand  side  cannot  be  done  in  parallel.  To  parallelize  this  stage 
of  computation,  we  apply  the  commutative  law  presented  in  theorem  (A.2),  resulting  in 

y  =  [I^  ®  Fm]  P{MN,  N)  [Im  ®  Fiv]  P{MN,M)  x  (43) 

If  it  is  required  that  the  Fourier  transformed  data  be  in  the  same  data-partition  scheme  as 
the  original  data  (say  mesh-division),  then  input  matrix  is  x  =  Pm  x  and  output  matrix  is 
^  y  (P^  =  =  Ik,  ®  P{N,Nlh)  ®  iM/k,).  Equation  (43)  can  be 

rewritten  as: 

^  ^  p^^  (g,  Fm]  P{MN,  N)  [Im  ®  Fyv]  [P{MN,  M)  Pm  ]  x  (44) 
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If  we  use  the  second  parallelization,  (b),  we  have 

y  =  [PmP(MN,  iV)]  [Im  ®  Fiv]  P{MN,M)  [I/v  0  Fm]  (45) 

In  the  following,  we  will  see  how  we  utilize  our  new  definitions  on  data-partition  and  migra¬ 
tion  to  maximize  the  parallelism  and  minimize  the  communication  cost  while  computing  equa¬ 
tion  (45).  From  equation  (43),  we  can  see  that  two  transpose  algorithms  are  required  P{MN,  N) 
and  P{MN,M).  Each  of  these  transpose  algorithms  needs  {k  -  1)  stages  of  message-passing 
on  a  ^-processor  machine  as  evidenced  in  the  last  subsection.  We  will  show  in  the  following 
how  we  reduce  the  communication  cost  of  one  of  the  two  transpose  algorithms  from  {k  -  1)  to 
{2ks  -  2)  by  manipulating  the  algorithm  expressions.  Note  that  A:  = 

Now,  let  us  consider  equation  (45).  The  first  stage  of  computation  is  which  converts 
the  mesh-division  into  column- division  for  FFT  computation  (Pc  P^  =  P^  since  Pc  is  an 
identity  matrix).  According  to  our  definition,  we  have 


p-^  =  [lk,®P{N,N/ks)®lM/k.  ■ 


Using  equation  (18)  and  theorem  (A.6),  we  have 

p-i  =  [ifc  (g)  P{Nlk,,Nlk)  0  iM/k]  [lk,®Pik,k,)®lMN/kl  ^ 


(46) 


The  above  factorization  on  P^^  results  in  two  stages.  The  first  stage,  ^i,  involves  {K  1) 
communications  with  h  columns  of  processors  communicating  in  parallel.  The  second  stage, 
2^2}  is  ^  local  vGctor-stride  data-shuffling. 


Similarly,  at  the  output  of  the  FFT,  we  can  also  manipulate  the  algebraic  expression  for 
mesh-division  data-partitioning.  The  last  stage  of  equation  (45)  converts  back  to  mesh-division 
data-partition,  which  can  be  simplified  as  follows. 


PmP{MN,  N) 

—  [ifc,  0  P{N,  ks)  0  iM/ks]  [P{Nk,,  N)  0  liU/fc, 
[lk,®P{MN/ks,N)] 
by  theorem  (A.7)  and  definition  (A. 3) 

=  [P(^,  ks)  0  Tvf/^/fc]  [Ifci  ®  P{MNIks,  A)] 

Zu 
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=  Zn  [ifc.  ®  PiNks,  N)  ®  [I<:  ®  P{MNIk,  iV)[ 


Zii  [/<:  ®  D(iV,  (gi  Ima 


®  P{k,ks)  ®  I,V//V/A:5 


(47) 


Zio  ^9 

From  the  transpose  algorithm  derived  for  mesh-division  partition  in  Section  A.3.2,  we  know 
that  Zii  represents  one  single  communication.  Stage  Zg  is  also  a  transpose  algorithm  similar 
to  stage  Zi  that  requires  {ks  -  1)  communications.  Therefore,  the  total  number  of  commu- 
nicalio'ns  required  to  carry  out  the  FFT  is  {k  +  2k.  -  2)  as  compared  to  (2k  -  2)  for  direct 
interfacing  between  the  Helmholtz  and  Jacobian. 


Another  variant  of  the  computation  can  also  be  obtained  easily  by  manipulating  the  tensor 
algebra  in  a  different  way.  Consider  the  last  stage,  [PmP{MN,  ^V)].  First,  we  factor  P{MN,  N) 

as  follows. 


P{MN,  N)  =  [lits  ®  P{MN/K,N/ks)]  [P{Mks,  ks)  0  Irr/k. 

by  theorem  (A. 6) 

=  [h,  ®  {P{N,  N/ks)  0  iM/fc.)  (Ifc,  0>  P{MNIk,  Niks)) 
j^P(iV/^:s,  ks)  0  liv/fc, 
by  theorem  (A. 7)  on  P{MNI ks,  N/ks) 

=  PJ,}  [Ia:  0  P{MN/k,  N/ks)]  [P{Mks ,  ks)  0 

by  equation  (18) 


(48) 


Therefore  by  theorem  (A. 7)  we  have, 

PmP{MN,  N)  =  [Ifc  0  P{MN/k,  N/ks)]  [P{Mks,  ks)  0  Irr/k, 
=  [Ifc  0  P{MN/k,  N/ks)]  [P(A:^  ks)  0  IwAf/fcs 

' — - - - ^  V. - - 

Zio  Zg 

Ik  0  P{M/ks,  ks)  0  iN/k, 

vL - - - ' 

Zs 


(49) 


Once  again  we  reduced  total  communication  cost  from  {2k  -  1)  to  {k  4-  2ks  -  3),  eliminating 
the  one  large  and  final  communication  from  the  previous  variant. 


Experiments  of  running  the  complete  application  based  on  our  derivation  above  have  been 
carried  out  on  Intel’s  iPSC/860.  The  execution  times  of  the  important  computational  modules 
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Nodes 

Jacobian 

Helmholtz 

Total 

row-D 

Mesh 

row-D 

Meshl 

Mesh2 

row-D 

Meshl 

Mesh2 

4 

2.8317 

2.7939 

0.11216 

0.18218 

0.16298 

2.9438 

2.9761 

2.9568 

16 

0.8128 

0.7310 

0.06094 

0.09950 

0.07688 

0.8738 

0.8305 

0.8079 

64 

0.3095 

0.1996 

0.10510 

0.12022 

0.08916 

0.4146 

0.3198 

0.2887 

Table  3:  Timing  results  for  128x128  size  vorticity  computations  Explanation:  Results  demonstrate 
that  by  restructuring  at  the  interface  of  Jacobian  and  Helmholtz  using  our  data-partition  expressions, 
we  could  improve  the  efficiency  of  Jacobian  at  the  cost  of  a  slight  decrease  in  efficiency  of  Helmholtz. 
This  resulted  in  total  improvement  of  the  efficiency  of  application. 

as  well  as  the  total  execution  time  were  measured.  The  results  reported  in  Table  3  are  averaged 
over  a  hundred  runs.  The  columns  marked  row-D  are  the  execution  times  of  row-division  while 
those  marked  Meshl  and  Mesh2  are  for  two  variants  of  mesh-division  computation  derived 
above.  From  this  table,  performance  improvement  of  up  to  43.61%  is  observed. 


A.4.2  A  New  FFT  Algorithm 

Existing  machine  library  on  Intel’s  multiprocessors  for  FFT  computation  are  based  on  row- 
division  or  column-division.  From  our  new  definitions  of  data-partitions,  we  developed  a  new 
communication  structures  for  parallel  FFT  algorithm  [17].  The  main  idea  is  to  partition  data 
according  to  mesh-division.  Rewriting  equation  (45),  we  have 

^  ^  P{MN,  N)  [Im  ®  -P(M iV,  M)  [Tv  0  F,v/]Pm  x,  (50) 


We  have  seen  transpose  algorithms  similar  to  the  one  in  the  above  equation  for  row-  or  column- 
division  FFT  algorithms.  Each  transpose  algorithm  requires  {2k  -  1)  communications.  Now, 
use  the  following  equality  (see  equation  (40))  to  substitute  P{MN,  M)  in  the  above  equation. 


P(MiV,  M)  =  Pm  [P{k,  ki)  0  iMN/k]  [lie  0  P{MN(k,  M/ki)]  Pm 


(51) 


Similarly,  for  P{MN,  N), 
We  have 

P{MN,  N) 


we  interchange  the  roles  of  M  and  N,  and  fci  and  k2  in  equation  (39). 


=  Pm  [Ifc  0  P{MNlk,  Nlk2)]  [P{k,  k2)  0  iMN/k 


Pm- 


Substitute  the  above  equality  to  equation  (50).  Then,  the  2D-FFT  algorithm  becomes 


y  =  [Ik®  P{MNIk,  N/k2)]  [P{k,  ^2)  0  iMN/k 
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PM(lvf®F,v)PM]  ®  liV/iV/fc 

[Ik®  P{MN/k,Mlki)] 

[P/vr  (Fv  ®  Fjw)  Pm 


Note  that  terms  [P{k^  k,)  ®  Fv/iv/.]  and  [P{K  h)  ®  iMN/k]  are  dummy  operations  with  re¬ 
spect  to  implementation  because  these  permutations  represent  exchange  of  entire  data  residing 
at  different  nodes.  This  can  be  done  by  addressing  processors  according  to  the  required  per¬ 
mutation  instead  of  data  movement.  Operations  that  start  with  Ik  are  parallel  operations  with 
no  communication.  For  the  remaining  two  terms  that  involve  Pm  and  P^',  we  can  decompose 

Pm  and  P^  as  following 


Pm  = 


P 


-1 

M 


lk2  ®  ®  [lit  ®  P{Nlk2M)  ®  iM/fci] 

Ifc  0  P(iV/fc2,  Njk)  ®  iM/fci]  [lfc2  0  P[kl,  ki)  0  iMN/klk, 


Each  of  Pm  and  P^  has  one  communication  stage  and  one  local  permutation  stage.  Each  of 
these  communication  stages  transmits  [k^  -  1)  messages,  with  the  size  of  each  message  being 
MN/kl k.  For  the  other  dimension,  each  of  the  Pm  and  P^/  will  have  {k,-l)  communications 
with  size  of  each  message  being  {MN/k^kj).  Therefore,  the  total  number  of  communications 

is  reduced  from  0{ki  *  ^2)  0{ki  +  ^2)* 


Experiments  to  measure  the  actual  performance  of  the  above  2D-FFT  and  the  existing  library 
routine  on  the  Touchstone  Delta  machine  have  been  carried  out.  The  measurements  are  reported 
in  Table  A.4.2.  The  results  shown  in  this  table  are  measured  with  a  library  routine  called 
dclockO  that  returns  a  double  precision  number.  Using  this  routine  at  the  beginning  and  at 
the  end  of  each  of  the  algorithms,  we  obtained  double  precision  time  in  milliseconds.  These 
timings  are  purely  for  execution  of  the  task  because  processors  are  not  time-sharing  by  multiple 
users.^However,  since  each  node  would  execute  in  a  slightly  different  time  due  to  the  underlying 
asynchronous  communication  network  of  machines,  we  considered  the  maximum  value  of  the 
times  reported  by  all  the  nodes.  Also,  we  have  averaged  timings  over  a  set  of  one  hundred 
experiments  with  forward  and  inverse  two-dimensional  transforms  for  each  data  size. 

Performance  of  two  different  implementations  are  reported  by  executing  them  on  128-node 
and  256-node  machine.  Various  data  sizes  that  we  have  tested  are  presented  in  the  first  column 
in  the  table.  Second  and  third  columns  represent  timings  for  existing  and  new  approaches, 
respectively,  on  128-node  machine  while  fourth  and  fifth  columns  are  for  the  cases  of  256-node 
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Dimensions 

128  nodes 

256  nodes 

Old 

New 

Old 

New 

M 

X 

iV 

f msecs] 

(msecs) 

(msecs) 

(msecs) 

128 

X 

128 

120.117 

27.727 

193.481 

31,711 

256 

X 

128 

120.151 

31.234 

192.980 

35.017 

256 

X 

256 

121.681 

34.165 

245.634 

39.499 

512 

X 

128 

125.425 

34.401 

210.761 

35.865 

512 

X 

256 

129.847 

44.944 

254.412 

44.948 

1024 

X 

128 

128.236 

44.883 

227.441 

43.225 

512 

X 

512 

125.901 

60.946 

270.365 

56.096 

1024 

X 

256 

133.562 

64.331 

262.051 

53.420 

1024 

X 

512 

152.919 

99.989 

285.066 

76.041 

1024 

X 

1024 

211.274 

177.306 

294.038 

119.288 

Table  4:  Results  on  Intel’s  iSdO  based  DELTA  machine.  Explanation:  These  results  reflect  the 
variations  in  communication  structure  for  "new”  and  "old"  algorithms  because  "new"  algorithm 
requires  44  and  60  communications  for  the  implementations  on  128  (16  x  8)  and  256  (16  x  16) 
processor  systems,  respectively,  while  "old"  algorithm  requires  254  and  510  communications,  respec¬ 
tively.  However,  it  is  to  be  noted  that  reduction  in  number  of  communications  in  "new”  algorithm 
is  traded-off  with  size  of  the  data  begin  communicated. 

machine.  It  can  be  seen  from  the  table  that  performance  gains  of  the  new  EFT  are  significant. 
We  observed  up  to  600%  performance  improvement  over  the  existing  machine  library. 

A. 5  Related  Work 

Data  organization  is  the  key  to  successful  parallelization  of  data  parallel  programs.  As  in¬ 
dicated  in  the  introduction,  there  are  two  tracks  of  efforts  in  data-partition  and  migration  in 
distributed  memory  multiprocessors:  automatic  data-partitioning  for  general  loop  constructs  as 
part  of  compiler  and  optimal  partitioning  for  a  specific  algorithm.  In  this  subsection,  we  briefly 
summarize  the  existing  works  in  this  field  as  related  to  our  work  presented  in  this  paper.  For 
more  comprehensive  review  of  previous  work  in  data-partitioning  and  redistribution,  readers 

are  referred  to  [3,  2,  5]. 
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Ramanujam  and  Sadayappan  [2]  studied  compile-time  techniques  for  data-partitioning  in  dis¬ 
tributed  memory  systems.  They  presented  an  analysis  of  communication-free  partitions  with 
a  nice  geometric  demonstration.  The  research  work  performed  by  Li  and  Chen  [6]  focused 
on  minimizing  data  movement  among  processors  due  to  cross-references  of  multiple  distributed 
arrays  (alignment  of  multiple  data  structures).  They  have  also  presented  a  method  of  automati¬ 
cally  generating  efficient  message-passing  routines  in  parallel  programs  [6].  Gupta  and  Banerjee 
introduced  the  notion  of  constraints  on  data-partitioning  to  obtain  good  performance.  In  [9], 
a  compiler  algorithm  was  described  to  automatically  finds  optimal  parallelism  and  optimal  lo¬ 
cality  in  general  loop  nesting.  All  these  studies  aimed  at  optimizing  data-partition  and  data 
alignments  as  part  of  compiler.  It  is  known  that  such  optimization  problem  is  NP-complete.  A 
number  of  heuristics  have  been  proposed  [6,  7,  8,  2,  18,  1]. 

The  use  of  tensor  product  notation  to  describe  parallel  algorithms  has  a  long  history  beginning 
with  Pease  [19].  Johnson  et  al  [20]  presented  a  comprehensive  discussion  on  how  to  use  tensor 
notations  to  design,  modify  and  implement  FFT  algorithms  on  various  computer  architectures. 
Attempts  to  derive  variants  of  FFT  algorithms  keeping  the  underlying  architecture  in  mind  have 
proven  successful  [10,  13].  Huang,  Johnson  and  Johnson  [21]  have  recently  used  tensor  notations 
for  formulating  Strassen’s  matrix  multiplication  algorithm.  Using  the  tensor  representation, 
they  derived  three  variant  programs  and  compared  their  performance  characteristics  for  shared 

memory  multiprocessors. 

Kaushik,  Huang,  Johnson  and  Sadayappan  have  proposed  a  very  nice  approach  for  data 
redistribution  in  distributed  memory  systems,  which  appeared  recently  in  [5].  While  their 
approach  also  utilizes  the  tensor  notation  as  a  tool,  our  work  differs  in  several  aspects.  First 
of  all,  our  definitions  are  expressed  in  matrix  forms  while  theirs  are  in  terms  of  indices  (tensor 
bases).  With  their  model  one  can  estimate  communication  cost  of  a  computation  precisely  while 
with  our  formulations  one  can  easily  manipulate  the  communication  structures  of  a  computation 
to  achieve  optimal  performance.  Deriving  variants  of  an  algorithm  using  our  definitions  are 
relatively  simple  because  the  data  communication  is  easily  visible.  Secondly,  all  the  definitions 
presented  in  [5]  such  as  cyclic,  block,  and  block  cyclic  can  be  defined  using  our  formulations 
as  evidenced  in  Section  3,  whereas  some  of  data-partitions  such  as  mesh-division  cannot  be 
easily  expressed  using  the  notations  in  [5].  In  addition,  our  representation  acts  directly  on 
data  vector  a(0  :  iV  -  1)  to  achieve  the  required  data-partition  and  migration  scheme  while 
their  representation  presents  ways  to  manipulate  data  indices  from  one  distribution  to  the 
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other  (redistribution).  Unlike  their  representation,  we  can  embed  our  expressions  for  data 
distribution  into  an  algorithm.  As  a  result,  global  optimization  of  an  application  consisting 
of  several  computation  modules  become  straightforward  by  just  manipulating  the  algebraic 
expression  at  the  interfaces  between  individual  algorithms. 

A. 6  Conclusions 

In  this  paper,  we  have  presented  a  formal  description  for  data-partition  in  distributed  mem¬ 
ory  multiprocessors.  Using  the  algebra  of  tensor  products  and  stride  permutations,  different 
schemes  of  storing  data  in  a  distributed  memory  system  are  represented  in  a  compact  and 
systematic  manner.  The  formalism  of  various  data-partitioning  schemes  allows  for  immedmte 
embedding  of  an  algebraic  expression  into  a  computational  algorithm.  As  a  result,  optimiza¬ 
tion  of  data-partition  becomes  simple  tensor  algebra  manipulations.  We  have  demonstrated 
the  usefulness  and  significance  of  our  formulations  by  considering  applications.  Experiments 
on  existing  distributed  memory  machines  have  been  carried  out.  Numerical  results  show  that 
significant  performance  gains  are  possible  by  using  our  formulations  to  generate  variants  of  an 
algorithm  tailoring  to  specific  system  architectures. 
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B  Efficient  Multidimensional  DFT  Module  Implemen¬ 
tation  on  the  INTEL  i860  Processor 

Abstract 

In  this  paper,  we  present  a  unified  implementation  methodology  for  computing  large  one-  and 
multi- dimensional  Fourier  transforms.  By  formulating  various  DFT  algorithms  in  the  language 
of  tensor  products,  any  large  size  Fourier  transform  is  built  up  by  a  collection  of  small  size  DFT 
modules  which  include  as  parameters  decimation  step  sizes  and  twiddle  factors.  These  param¬ 
eters  are  introduced  in  the  DFT  modules  to  take  advantage  of  modern  computer  architectures 
with  parallel,  pipelined,  multi-functional  structures,  while  providing  flexibility  into  the  building 

blocks. 

B.l  Introduction 

Continuing  our  work  [1]  presented  in  ICSPAT’92,  we  have  developed  a  unified  implementation 
methodology  for  computing  large  one-  and  multi- dimensional  Fourier  transforms.  Tensor  prod¬ 
uct  formulation  of  various  DFT  algorithms  plays  a  central  role  in  unifying  implementation  by 
identifying  small  number  of  computational  cores  and  necessary  parameters.  Our  library  of  core 
computation  modules  has  the  following  features. 

•  We  have  efficiently  implemented  prime  factors  3,  5,  7,  11,  13,  17  as  well  as  powers  of 
2.  Thus,  transform  size  on  each  dimension  of  a  multi-dimensional  Fourier  transform  can 
have  factors  other  than  2. 

•  One-dimensional  small  modules  take  advantage  of  vector  operations  on  i860  by  looping 
on  other  factors  of  the  same  dimension  and  other  dimensions. 

•  One-dimensional  small  modules  have  pre-calculated  twiddle  factor  array  as  a  parameter. 
This  provides  for  intermediate  stages  of  Cooley-Tukey  FFT  implementation. 

It  is  widely  believed  that  data  size  on  each  dimension  must  be  a  power  of  two.  In  fact, 
a  popular  reference  on  numerical  methods  [2]  recommends  that  if  the  data  are  defined  over  a 
period  whose  size  is  not  a  power  of  two,  they  are  to  be  filled  with  zeros  up  to  the  next  power  of 
two.  In  multi-dimensional  DFT  computation,  this  will  increase  the  transform  size  dramatically, 
not  only  slowing  down  the  computation  but  also  causing  cache  thrash  and  memory  overflow. 
In  the  case  of  the  parallel  computer  iPSC/860,  each  node  processor  has  8M  byte  memory.  If 


the  size  of  complex  data  to  be  processed  is  72  x  72  x  72  =  373, 248,  computation  is  made  in  the 
local  memory  of  the  processing  unit  without  data  segmentation.  On  the  other  hand,  by  padding 
with  zeros,  the  size  of  complex  data  to  be  processed  will  be  128  x  128  x  128  =  2, 097, 152,  which 
is  beyond  the  capacity  of  local  memory;  segmentation  and  data  loading  in  and  out  will  cause 

severe  problem. 

In  this  paper,  we  will  describe  an  implementation  strategy  for  efficient  multi-dimensional 
DFT  routines  on  the  Intel  i860  processor.  Timing  results  of  some  sample  medium  size  of  2- 
dimensional  DFT  modules  with  prime  factor  on  each  dimension  is  provided.  The  results  of 
comparable  power  of  2  FFT  package  [6]  that  are  commercially  avaiable  are  also  included. 


B.2  Tensor  Product  Formulation 


The  tensor  product  presentation  of  fast  Fourier  transform  algorithms  dates  back  to  Pease  s 
paper  [7]  of  1968.  Its  role  in  application  has  varied  during  this  period,  from  that  of  a  notational 
convenience  for  describing  a  complex  algorithm  to  that  of  an  interactive  programming  tool.  A 
detailed  discussion  on  tensor  product  identities  can  be  found  in  [8].  In  this  paper,  we  emphasize 
the  tensor  product  as  a  programming  tool  in  DFT  module  implementation.  The  parameters  that 
govern  the  data  permutation,  vector  segmentation,  an  algorithm’s  granularity  and  parallelism, 
come  naturally  from  tensor  product  formulation  of  various  algorithms. 

One-dimensional  iV-point  Fourier  transform  of  array  x  is  defined  as 


y  =  F{N)x. 

where  F{N)  is  an  A"  x  A  matrix  defined  by 

T  1  1 


F{N)  = 


w 


1  w 

1 


1 


w 


(iV-lp 


where  w  =  e 

The  Ni  X  N2  2-dimensional  Fourier  transform  of  X,  denoted  by 


F(Ai,A2)A 


(1) 


(2) 


(3) 


can  be  written  in  a  matrix  form  as 


F  =  F{Ni)XF{N2), 


(4) 


where  X  and  Y  are  Ni  x  N2  2-dimensional  input  and  output  arrays  respectively. 

Denote  by  x  the  vector  in  (7^,  N  =  NxN2,  formed  by  reading  in  order,  down  the  columns 
of  X,  and  y  formed  the  same  way  from  Y .  We  can  write  the  2-dimensional  Fourier  transform 
in  a  tensor  product  format: 

y_  =  {F{N2)®F{Nx))x.  (5) 

(5)  can  be  factored  as; 

y_  =  (^(^^2)  ®  InMIn,  ®  F{Nx))x.  (6) 

(6)  is  usually  refered  to  as  the  row-column  method:  ®  F{Ni)  computes  on  the  rows,  and 

F{N2)  ®  Ini  computes  on  the  columns. 

The  tensor  product  formulation  of  2-dimensional  Fourier  transform  in  (5)  provides  a  general 
format  for  multi-dimensional  Fourier  transforms.  Denote  the  K’-dimensional  Fourier  transform 
of  array  X  of  size  Ni  x  N2  x  •  ■  •  'X  Xk  is  denoted  by 

Y  =  F[Nx,N2,---,Nk)X  ■  (7) 

Denote  by  x  the  vector  m  ,  N  =  NxN2- ■  ■  Nk,  formed  by  reading  in  order  down  the 
columns  of  X  along  Ni  dimension  and  then  N2  till  Nk  dimension,  and  y  formed  the  same  way 
from  Y,  we  can  write  multi-dimensional  Fourier  transform  of  (7)  in  a  tensor  product  format; 

y  =  {F{Nk)  ®  •  •  •  0  F{N2)  0  F{Ni))x.  (8) 

(8)  can  be  factorized  into  K  stages  of  Fourier  transform  computation. 

y  =  {F{Nk)  0  Ink-v-Ni)-  •  • 

0  F{N2)  0  In,){Ink-n,  0  FiNx))x.  (9) 

Every  stage  of  (9)  is  of  the  form 

h  0  F{M)  (0  Is.  (10) 

The  structure  of  (10)  suggests  a  unified  implementation  methodology  of  multi-dimensional 
Fourier  transform  by  a  set  one- dimensional  DFT  modules  with  parameters  L  and  S:  S  deter¬ 
mines  the  stride  permutation;  L  determines  the  looping. 

The  tensor  product  formulation  of  multi-dimensional  Fourier  transform  in  (9)  is  exactly 
the  row-column  method  of  multi-dimensional  Fourier  transform  computation.  The  modular 
implementation  of  (10)  immediately  suggests  an  efficient  way  of  taking  advantage  of  parallel 
and  vector  architectures  of  the  target  computer  system.  The  stride  parameter  replaces  the  global 
permutation  after  each  stage  of  DFT  computation;  The  looping  parameter  replaces  calling  the 
same  subroutine  many  times. 


B.3  Cooley-Tukey  FFT  Algorithms 

Suppose  N  -  LM.  The  Cooley-Tukey  algorithm  (decimation-in-frequency)  for  one-dimensional 
Fourier  transform  is  given  by  tensor  product: 


F{N)  =  P{N,M){Il  ®  F{M))TMiN){F{L)  0  Im] 

1,  (11) 

• 

where  P{N,M)  is  a  x  Af  stride-M  permutation  matrix,  Tm{N)  is  a 

matrix  of  twiddle  factors, 

L~1 

N  X  N  block  diagonal 

Tm{N)  = 

1=0 

(12) 

where 

Dm{N)  =  diag.{l,w,  •  •  • , 

(13) 

The  Cooley-Tukey  FFT  algorithm  given  in  (11)  can  be  used  in  an 

inductive  argument  to 

derive  extension  to  many  factors.  Suppose 

N  =  N1N2  ■■■Nk. 

(14) 

Set  iV(0)  =  1  and 

N{k)  =  NiN2---Nk,  l<k<K, 

(15) 

N'{k)  =  N/N{k),  0<k<K. 

(16) 

Define 

F'  =  Tk{lN(k-i)  ®  F{Nk)  ®  lN'(k)), 

(17) 

where  Tk  is  a  diagonal  matrix 

Tk  =  ®  TM'(k){N'{k  -  1)). 

(18) 

Then  we  have  the  Cooley-Tukey  FFT  algorithm  for  many  factors: 

F{N)  =  QF'^---F^Fi, 

(19) 

where  Q  is  the  generalized  bit-reversal  permutation  matrix. 

Each  stage  of  Fourier  transform  F^,  1  <  k  <  K  with  the  twiddle  factor  multiplication  can 

be  written  as: 

iN(k-i)  ®  iTN'(k){N'{k  -  mF{Nk)  ®  W)). 

(20) 

(20)  can  be  implemented  as  a  module 
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(21) 


N'(k)-1 

/iv(fc-i)  ®  ^  —  l)){F{Nk)  ®  lN'{k))]- 

j=0 

The  parameters  of  this  module  are  N'{k  —  1),  N{k  —  1)  and  twiddle  factors  —  1)). 

Although  the  form  (21)  does  not  look  as  neat  as  (10),  the  implementation  is  as  easy.  The 
twiddle  factors  are  introduced  into  the  module  that  varies  with  the  stride  parameter.  Thus  any 
large  size  Fourier  transform  computation  is  made  by  putting  together  a  set  of  modules  given  in 

(10)  and  (21). 

B.4  Multi-Dimensional  FFT  Algorithms 

In  this  section,  we  will  show  that  various  multi-dimensional  DFT  algorithms  can  be  unified  into 
the  format  described  in  the  previous  sections:  they  can  be  decomposed  into  identifiable  basic 
building  blocks  of  small  size  modules. 

Row- Column  Method 

Consider  the  2-dimensional  Fourier  transform  of  (5).  The  row-column  method  of  computing 
y  is  written  as: 

y  =  {F{N2)®lN,)ilN,0F{N:))x.  (22) 

Suppose  Ni  =  TiMi  and  N2  =  L2M2.  Using  the  Cooley-Tukey  FFT  algorithm  of  (11)  into 
(22),  we  have 

{{P{N2,M2){Il,  ®  F{M2))TmAN2) 
x{F{L2)®1m,))  ®  In,){In2  ®  {P[NuMi) 
x{Il,®F{Mx))TmA^x){F{L^)®Im,)))  (23) 

The  implementation  of  2-dimensional  DFT  in  (23)  has  the  same  structure  as  1-dimensional 
FFT  in  (19).  Two  sets  of  DFT  modules  are  computed;  one  with  twiddle  factor  multiplications, 

Ln—\ 

®  4.)].  (24) 

i=o 

i  =  0,1,  0  <  i  <  Mi,  and  Ln  and  Lm  are  the  parameters  controlling  the  decimation  and 
looping,  and  twiddle  factor  parameters  come  from  Di.{Ni); 

The  other  module  without  twiddle  factors; 


Ilk  ®  F{Mi)  ®  Ihii 


(25) 


Lk  and  Li  are  the  parameters  of  the  module. 


Vector-Radix  Method 

The  vector-radix  Cooley-Tukey  FFT  algorithm  to  compute  (22)  is  given  in  the  following 


factorization: 

F{N2)®F{N,)  =  PF'TF{, 

(26) 

where 

T  =  Tm,{N2)®TmANi), 

P  =  P{N2,M2)®P{N„Mi), 

(27) 

(28) 

F{  =  F{L2)  ®  Im2  ®  F{Li)  (g)  Imi 
=  {F{L2)  ® 

{IL2M2  ®  F[L\)  g)  /m:), 

(29) 

F2  =  hi  ®  F{M2)  ®  hi  ®  F{Mi) 

=  {Il2  ®  F{M2)  ®  hiMi ) 

{hiMiLi  ®  F{Mi)), 

(30) 

(26)  can  be  computed  by  using  the  modules  without  twiddle  factors  and  a 
stride  permutation  of  P  and  a  twiddle  factor  multiplication  stage  T . 

separate  stage  of 

B.5  Implementation  on  Intel  i860  Processor 

In  this  section,  we  give  an  example  of  carrying  out  the  computation  of  multi-dimensional  DFT 
using  our  tensor  product  modules.  Take  the  case  40  x  40  2-dimensional  Fourier  transform.  Set 
40  =  5  X  8.  The  tensor  product  form  of  the  Cooley-Tukey  FFT  algorithm  (row-column  method) 

IS 

F(40,40)  = 

((P(40, 5)(/8  ®  P(5))T5(40)(P(8)  ®  h))  ®  ho) 

X  {ho  ®  (P(40,5)(/8  0  F(5))T5(40)(P(8)  0  h))  ^ 

Variants  can  be  derived  from  (31).  One  of  them  is 

(31) 

(32) 


(P(40, 5)  ®  /4o)(/8  O  F{5)  ®  /4o)((T5(40)(F(8)  ®  h))  ®  ho) 

X  (740  0  7^(40,  5))(732o  0  Fi5))){ho  0  (r5(40)(F(8)  ®  75))) 

Both  forms  have  their  advaintages.  For  the  Intel  i860  processor,  algorithm  (31)  gives  rise  to 
faster  implementation  because  it  minimizes  the  cache  thrash.  The  implementation  of  (31)  is 
given  as: 

c  transform  on  the  columns 
do  i=0,39 

call  ftc8tw(  x(0,i),  5,  1,  1,  w,  isign  ) 
call  ftc5(  x(0,i),  y(0,i),  1,  8,  1,  isign  ) 
call  transpose(y,  x) 
end  do 

c  transform  on  the  rows 

The  implementation  of  (32)  is  given  as: 

c  transform  on  the  columns 

call  ftc8tw(  X,  5,  1,  40,  w,  isign  ) 
call  ftc5(  X,  y,  1,  8,  40,  isign  ) 
c  transform  on  the  rows 

call  ftc8tw(  y,  5*40,  40,  1,  w,  isign  ) 
call  ftc5(  y,  x,  40,  8*40,  1,  isign  ) 

The  module  ftcStw  computes 

7th  0  [  ^  T>^(40)(7^(8)  (g)  7„2)],  (33) 

i=o 

isign  denotes  the  forward  or  reverse  transform,  w  denotes  pre-calculated  twiddle  factors,  and 
the  module  ftc5  computes 

h,®F{5)®In,.  (34) 

The  timing  results  of  some  of  the  one-  and  2-dimensional  Fourier  transform  are  given  in 
tables  1  and  2.  They  are  compared  to  the  Kuck  and  Associates,  Inc.  Math  Library  Package  on 
the  Intel  iPSC/860.  It  is  worth  mentioning  that  Intel’s  1-dimensional  and  2-dimensional  FFT 
routine  are  hand  coded  assembly  program,  while  the  AwareTime  are  the  hybrid  of  Fortran  calls 
and  i860  hand  coded  assembly  modules. 
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Table  1.  Timing  Results  on  i860  Processor (1-D) 


FT  Size  N 

AwareTime  ms. 

IntelTime[6]  ms. 

3 

0.000363, 

4 

0.000449 

0.0119 

5 

0.000881 

7 

0.00164 

8 

0.00139 

0.0141 

16 

0.007 

0.0191 

20 

0.0098 

32 

0.0133 

0.031 

40 

0.0211 

1 

64 

0.028 

0.065 

80 

0.0528 

384 

0.296 

512 

0.350 

0.560 

ms.  =  10  ^  second. 


Table  2.  Timing  Results  on  i860  Processor(2-D) 


FT  Size  n  X  n 

AwareTime 

IntelTime 

32  X  32 

1.386  ms. 

40  X  40 

2.400  ms. 

64  X  64 

6.65  ms. 

80  X  80 

12.9  ms. 

128  X  128 

24.7  ms. 

160  X  160 

58.6  ms 

256  X  256 

137  ms. 

384  X  384 

296  ms 

512  X  512 

777  ms. 
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C  A  New  Approach  for  Computing  Multi-dimensional  DFTs  on 
Parallel  Machines  and  its  Implementation  on  the  iPSC/860 
Hypercube 


Abstract 

In  this  paper  we  propose  a  new  approach  for  computing  multi-dimensional  DFTs  that  reduces  in¬ 
terprocessor  communications  and  is  therefore  suitable  for  efficient  implementation  on  a  variety  of 
multiprocessor  machines.  Group  theoretic  concepts  are  used  to  formulate  a  computational  strategy 
that  hybrids  the  Reduced  Transform  Algorithm  (RTA)  with  the  Good-Thomas  factorization.  The 
RTA  algorithm  is  employed  not  as  a  data  proce.ssing  but  rather  as  a  book-keeping  tool  in  order  to 
decompose  the  problem  into  many  smaller  size  sub-problems  that  can  be  solved  independently.  Im¬ 
plementation  issues  on  an  Intel  iPSC/860  hypercube  are  discussed  and  timing  results  are  provided 
for  many  different  cases.  The  non-optimized  realizations  of  the  new  approach  are  shown  to  out¬ 
perform  the  highly  optimized  realizations  of  the  traditional  row-column  method  in  a  variety  of  test 
cases. 


C,1  Introduction-Motivation 

Parallel  computing  presents  a  new  environment  for  algorithm  design  and  implementation,  along 
with  new  challenges  to  the  computational  scientist.  The  performance  of  any  given  program  depends 
on  an  increased  number  of  parameters  compared  to  the  serial  case,  widening  this  way  the  difference 
between  theoretical  models  and  practical  experience. 

In  this  paper,  we  present  a  strategy  for  computing  a  multidimensional  DFT  that  hybrids  a 
relatively  new  algorithm  (Reduced  Transform  Algorithm)  with  already  implemented  single  proces¬ 
sor  kernel  routines.  We  will  use  the  reduced  transform  algorithm  to  address  the  reduction  and 
optimization  of  interprocessor  communications.  Our  work  has  been  mainly  motivated  from  the 
distributed  memory  parallel  computing  paradigm,  which  is  arguably  the  most  difficult  to  harness 
due  to  its  exposed  interprocessor  communication  to  the  programmer.  Most  parallel  computers  re¬ 
quire  sophisticated  algorithms  and  programming  techniques  for  their  optimum  utilization.  In  this 
discussion,  we  will  make  use  of  algebraic  facts  in  presenting  the  algorithms.  The  parameters  in  al¬ 
gebraic  formula.s  give  us  the  important  implementation  parameters.  Thus  the  flexibility  to  address 
the  variables  in  implementations  is  equated  with  flexibility  in  manipulating  algebraic  formalism. 
Initial  investment  in  familiarity  with  some  amount  of  algebra  may  be  necessary,  but  the  payoff  is 
immediate.  Most  of  the  relevant  algebra,  not  in  its  most  rigorous  form  but  its  usage,  can  be  found 
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in  [1]. 

In  its  most  general  form,  the  Reduced  transform  algorithm  (RTA)  is  a  fuU  utilization  of  the 
duality  between  periodic  and  decimated  data  in  the  Fourier  transform.  This  duality  was  used 
partially  in  some  algorithms  and  implementations  for  restricted  cases  [2,  3,  4,  5].  A  description  of 
a  generalization  in  a  unified  setting  is  found  in  [6,  7]  along  with  the  work  of  M.  Rofheart  [8].  In 
this  paper,  we  will  consider  the  application  of  RTA  to  the  case  Z/P  x  Z/P,  for  a  prime  number 
P.  Tensor  product  formulation  of  DFT  computation  on  Z/A'  x  Z/P  x  Z/P  is  interleaved  with  the 
periodization  step  in  RTA  for  Z/P  x  Z/P  to  produce  P  +  1  independent  data  of  size  NP. 

We  will  use  the  RTA  to  address  the  imbalance  between  computation  and  communication  rates 
in  current  distributed  memory  parallel  machines  by  reducing  communication  between  processors  to 
collective  patterns  only  (broadcast  and  combine)  instead  of  the  aU-to-all  communication  patterns 
required  in  the  global  matrix  transpose  needed  by  the  row-column  (RC)  implementations  of  mul¬ 
tidimensional  DFT’s.  Also,  since  fast  algorithms  for  prime  size  ID-DFT’s  exist  [1]  and  the  case 
Z/P  x  Z/P  of  the  RTA  is  very  efficient  because  its  computation  requires  only  P-j- 1  ID  transforms 
(versus  2P  for  the  row  column  method),  our  approach  addresses  the  issue  of  storage  reduction  by 
providing  additional  transform  size  options.  For  example  the  ability  to  perform  a  181  x  181  point 
2D  DFT  means  potential  storage  savings  up  to  50%  over  the  256  x  256  case,  along  with  the  savings 
in  computational  time.  The  storage  savings  can  be  used  for  the  optimization  of  the  broadcasting 
step  needed  for  the  RTA,  in  environments  with  long  communications  latency. 

Via  the  Chinese  remainder  theorem.,  we  will  extend  our  method  to  compute  the  3-dimensional 
DFT  on  Z/AT  X  Z/MP  x  Z/KP,  where  N  is  an  arbitrary  integer,  M  and  K  are  integers  not 
divisible  by  P,  for  a  prime  P.  We  transform  the  data  set  to  an  equivalent  5D  data  set  on  Z/N  x 
Z/M  X  Z/Jv'  X  Z/P  X  Z/P,  and  then  employs  the  RTA  on  the  last  two  indices  to  break  the  problem 
into  smaller  independent  sub-problems  that  can  be  computed  in  parallel.  Each  sub-problem  is 
associated  with  the  computation  of  the  value  of  the  Fourier  Transform  along  one  line  in  the  set 
Z/P  X  Z/P  passing  through  the  origin.  These  lines  intersect  only  at  the  origin  and  cover  the  index 
space.  When  translated  from  the  5D  data  set  back  to  the  original  3D  data,  each  line  corresponds 
to  a  set  of  parallel  lines  covering  the  index  space. 

Three  stages  are  needed  to  compute  the  values  of  the  DFT  along  the  lines:  (1)  Periodization 
stage,  which  consists  of  additions  of  data  along  lines  perpendicular  to  a  given  line,  (2)  3D  Cooley- 
Tuckey  FFT  and  (3)  P-point  DFT.  In  a  multiprocessor  environment,  each  processor  computes 
these  three  steps  independently  of  the  others  thus  allowing  for  maximum  parallelism  and  efficiency. 
Moreover,  the  final  data  distribution  among  the  processors  is  such  as  to  permit  further  processing 
in  a  parallel  fashion  since  every  processor  holds  only  results  belonging  to  the  same  geometrical 
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subset. 

The  proposed  hybrid  method  (HRTA)  can  be  used  in  applications  such  as  the  computation  of 
motion  from  a  sequence  of  images  (multi-frame  detection,  MFD),  a  very  important  task  in  computer 
vision,  HDTV  and  video  telephony.  Several  methods  for  MFD  have  been  proposed  in  the  literature 
that  are  usually  divided  into  two  categories:  Time  Domain  methods,  that  estimate  the  motion  by 
processing  the  sequence  of  images  directly,  and  the  recently  proposed  Frequency  Domain  methods 
[9],  [10]  that  processes  the  frequency  contents  of  the  images  to  estimate  the  velocity  and  trajectory 
of  the  moving  components.  The  latter  methods  olfer  more  robust  detection  and  huge  computational 
savings  since  the  frequency  domain  representation  of  the  3D  data  (sequence  of  2D  images)  is  more 
compact  than  the  equivalent  time  domain  representation.  With  all  the  processors  holding  data 
belonging  to  different  lines  in  the  frequency  domain,  each  processor  can  independently  test  for  the 
presence  of  motion  along  its  assigned  direction. 

This  paper  is  organized  as  follows:  In  section  2  we  describe  the  RTA  with  an  application  on 
Z/NxZfPxZ/P  and  its  parallel  processing  strategy.  In  section  3  we  discuss  the  extension  via  the 
Chinese  remainder  theorem  and  introduce  hybrid  algorithm  (HRTA)  that  we  use  onZ/NxZfMPx 
Z/A’P,  and  its  parallel  variant.  In  section  4  we  discuss  issues  related  to  the  implementation  of  the 
hybrid  algorithm  on  the  Intel  iPSC/860  parallel  machine.  In  section  5  we  present  detailed  timing 
results  and  a  thorough  comparison  of  our  approach  with  the  traditional  row-column  method  for  a 
variety  of  2D  and  3D  DFT  cases.  We  close  the  paper  in  section  6  by  summarizing  our  findings  and 
propose  directions  for  further  investigation. 

C.2  The  Reduced  Transform  Algorithm  (RTA)  on  Z/P  x  Z/P 

Before  we  proceed  we  need  the  following  definitions: 

Let  G  be  an  abelian  group  of  the  form 

G  =  Z/Ni  X  Z/A2  X  •  •  •  X  Z/Nr. 


For  g,  h  e  G,  define  the  bilinear  map  from  G  to  C*  the  complex  numbers  of  magnitude  1  by 


X(g,h)  = 


_^-2^gRhR 


(1) 


where  g  =  (51,52, *•  •,ffR),  h  =  (/ii, ^2, •  •  • , M-  Since  prhr,  1  <  r  <  P,  is  uniquely  defined  in 
ZfNr,  (1)  is  well  defined.  For  a  subgroup  S  of  G,  the  dual  of  S,  denoted  by  is  the  following 
subgroup  of  G. 


5-*-  =  {g  e  G  :x(g,s)  =  1,  for  all  s  €  S). 
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In  addition  to  duality,  we  will  use  the  following  definition.  Let  5  be  a  subgroup  of  an  abelian  group 
G.  A  subgroup  5*^  of  G  is  called  a  complementary  subgroup  of  S  if  every  element  g  €  G  can  be 
written  as 

g  =  s  +  c,  s€5,  c€  S~. 

In  general,  the  complementary  subgroup  is  not  unique.  Moreover,  not  every  subgroup  has  a  com¬ 
plementary  subgroup. 

Let  G  =  Tt/P  X  Z/P,  where  P  is  a  prime  number.  Non-trivial  subgroups  of  G  are  of  order  P, 
and  hence  cyclic.  In  addition,  every  subgroup  of  G  ha.s  a  complementary  subgroup.  The  Reduced 
Transform  Algorithm  (RTA)  on  G  proceeds  as  follows: 

1.  Determine  output  decimating  subgroups  to  cover  G: 

For  0  <  /  <  P  set: 

P,  =  :  0  <  a  <  P  -  1}, 

and  for  /  =  P  set: 

Pp  =  {(0,fl):0<n<P-l}. 

We  have  ^ 

UP|  =  G. 

1=0 

2.  Determine  the  input  periodizing  subgroups,  for  0  <  /  <  P. 

Denote  by  Q;,  0  <  /  <  P,  the  following  subgroups  of  G. 

Q,  =  {b(-l,l)  :0<b<  P}. 


Also  for  I  =  P 

Qp  =  {6(1,0)  :0<5<P}. 

Q/,0  <  /  <  P,  is  a  subgroup  of  order  P  and  Q;  =  P/". 

In  Figure  1  we  show  the  output  decimating  subgroups  P/  for  G  =  Z/P  x  Z/P,  P  =  3  (area 
inside  the  box).  If  we  extend  the  index  space,  each  P;  corresponds  to  a  “line”.  Due  to  the  modulo 
P  operations  certain  points  of  a  line  outside  the  box  (marked  with  a  circled  +)  will  be  mapped 
inside  (to  the  corresponding  circled  node  with  *  in  the  same  row/column).  Also  note  that  due  to 
the  periodicity,  the  two  lines  labeled  P2  are  actually  the  same.  AU  lines  intersect  at  the  origin.  In 
the  same  figure  we  show  the  input  periodizing  subgroups  Q/.  The  collection  Q/  of  input  lines  cover 
the  whole  index  space,  as  the  collection  P/  of  output  lines  do,  and  are  dual  to  them. 

3.  Compute  the  Periodizations. 
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Figure  1:  The  output  decimating  subgroups  (lines)  P/  and  the  input  periodizing  subgroups  (lines) 
Ql  for  the  case  P  =  3. 

A  periodization  is  completely  determined  by  its  values  on  a  complementary  group.  Fix  a  comple¬ 
mentary  subgroup  for  Q/,  0  <  /  <  P,  and  denote  it  by  Q^. 


gi(c)=  ^  /(c  +  b),  0<l<F,ceQf. 
heQi 

Although  there  are  many  choices  for  complementary  subgroups,  we  will  fixed  them  to  be: 

Qf  =  {(c,0):0<c<P-l},  0<1<P 

Q^  =  {(0,c):0<c<P-l},  l  =  P 
4.  Compute  the  DFT.  For  a  e  P/, 

fii^)=  E  E  /(c  +  b)x(c-t-b,a). 

ceQ'beQ, 


(2) 


Since  x(b,a)  =  1,  using  (2)  we  get 


E  g/(c)x(c,a). 


(3) 


For  0  <  /  <  P  -  1,  we  will  use  the  following  identification  to  index  the  computations, 

(a,  a/),  h^{-bl,b),  c^(c,0), 

0  <  a,  6,  c  <  P,  a  G  P/,  b  €  Q(,  c  G  Qf. 


a 
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For  /  =  P,  the  identification  is: 


(0,a),  b^(6,0),  c^(0,c), 


0<a,  6,  c<P  a€P;,b€Q/,ceQ/'. 

Therefore  we  can  rewrite  (2)  and  (3)  as  follows. 

P-i 

gi{c)='£f{c-bl,b),  0<1<P,  0<c<P 

6=0 

P-1 

9Pic)  =  I]  0  <  c  <  P 

6=0 

P-1  ^  . 

7(a,  al)  =  0  <  /  <  P,  0  <  c  <  P 

c=0 

f{0,a)=  '^gp{c)e-^^‘',  0<c<P 

c=0 

C.2.1  Application:  The  case  A  =  7i/N  x  Z/ P  x  Z/ P 

Let  A  =  Z/N  X  Z/P  x  Z/P,  for  a  natural  number  N  and  a  prime  number  P.  For  /  €  L(A)  and 
(u,  u,  ^t;)  6  A,  the  Fourier  transform,  /,  is  defined  by 


fin,  V,  w)  =  "f  x;  fix,  y,  . 

z=0  y=0  a:=0 


For  a  €  P;,  0  <  /  <  P, 


/>,a)=  ^  E  + 

ceQ'beQ,  ^=o 

where  u;  =  e~^ ,  and  <  a,  c  >  corresponds  to  the  usual  inner  product.  Changing  the  order  of 


summation, 


or  equivalently, 


E  S  Z)  /(a:,c  +  b)  e 
ceQ"  \b€Q,  / 


^<a,c>_ 


-2iri^,  ,<a,C> 


fliu,a)=  E 
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This  computation  can  also  be  rewritten,  using  our  identification  scheme  as  follows: 
For  0  <  /  <  P  -  1, 


P-lN-l 
f(u,a,al)=  ^  '^gi(x,c)e 

c=0  r=0 


P-’l  iV-1 

c=0  X— 0 


—  27r2“^  dc 
N  u 


and  for  I  =  P, 


P-l  iV-l 

/(u,0,a)=  Y.  Y 

c=0  x=0 


P-1  iV-1 

=  E  S  0)6-2” 

c=0  r=0 


In  Figure  2  we  depict  the  three-dimensional  index  set  A  =  Z/N  x  Z/P  x  Z/P  in  which  planes 
defined  by  the  last  two  indices  are  partitioned  into  lines.  In  essence,  the  algorithm  can  be  thought 
of  as  N  times  the  RTA  on  data  sets:  Z/P  x  Z/P. 


Figure  2:  The  3D  index  set,  partitioned  into  lines  along  the  last  two  dimensions,  and  into  parallel 
planes  along  the  first  dimension. 


C.2.2  The  parallel  processing  strategy 

In  the  previous  subsection  we  have  shown  how  the  DFT  on  a  3D  index  set  can  be  partitioned  to 
independent  computations.  For  each  one  of  the  P  -|-  1  lines  of  a  plane,  P  periodizations  needs  to 
be  computed  for  a  total  of  A  •  (P  -I- 1)  •  P  periodizations.  Depending  on  the  number  of  processors 
(PEs)  and  the  available  memory  per  processor,  different  parallel  implementations  can  be  derived. 
If  P  +  1  PEs  are  available,  each  one  can  be  assigned  to  compute  the  DFT  on  one  of  the  lines,  and 
there  is  no  need  for  interprocessor  communications.  This  scheme  however  requires  that  each  PE 
has  access  to  the  whole  data  set  and  is  able  to  store  at  least  the  periodized  data  along  a  whole  line 
for  aU  values  of  a:  [N  •  P  elements). 


New  Parallel  Implementations 


103 


Alternatively  if  P  ■  (P  +  1)  processors  are  available,  each  one  may  be  assigned  to  compute  the 
values  of  the  DFT  for  one  of  the  P  points  that  belong  to  a  particular  line.  For  such  an  implemen¬ 
tation,  the  minimum  memory  requirements  for  a  node  is  reduced  to  N,  and  more  parallelism  is 
exploited  at  the  expense  of  some  inter-processor  communications.  The  parallel  processing  strategy 
is  summarized  below. 

step  1:  Compute  in  parallel  the  NiP"^  -j-  P)  Periodizations 

gi{x,c)  =  ^  /,(a;,c  +  b),  (4) 

beQ, 

[  Er='o /K6.c),  i=p- 

•  If  P  -H  1  PEs  are  used: 

PEt,  I  =  0,  ...,P,  computes  the  iV  •  P  periodizations  {gi{x,c),  0  <  x  <  N,  0  <  c  <  P}.  No 
interprocessor  communications  are  required. 

•  If  P^  -h  P  PEs  are  used: 

PEici  1  =  0,  ...,P,  c  =  0,...,P  —  1,  computes  the  N  periodizations  {gi{x,c),  0  <  x  <  N}. 
Since  the  summation  in  (4)  extends  over  0  <  6  <  P  -  1  PP/,c  needs  to  receive  data  residing  in  each 

PP/,-,,  1  ^  c. 

step  2:  Compute  the  ID,  N-point  DFTs. 

N-l 

gi{u,c)  =  ^ 
r=0 

•  If  P  4- 1  then  PEi  computes  the  P,  ID  N-point  DFTs  0  <  c  <  P}. 

•  If  p2  +  P  then  PEi,c  computes  an  ID  N-point  DFT,  namely  gi{u,c). 

No  interprocessor  communications  are  required  in  either  cases. 

step  3:  Compute  the  P -point  ID  DFTs. 

f{u,a)= 

C€Q^ 

.  If  P  -h  1  then  PEi  computes  the  N,  ID  P-point  DFTs  {/(u,  a),  0  <  u  <  iV}.  No  interprocessor 
communications  are  required. 

•  If  p2  -I-  P  then  PP(,c  computes  an  ID  P-point  DFT,  namely  /(w,a).  Since  the  summation  in  (5) 
extends  over  0  <  c  <  P  -  1  PP(,c  needs  to  receive  the  partial  result  gt{x,'</)  from  each  PEi,^,  7  7^  c. 
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C.3  Extension  via  the  Chinese  Remainder  Theorem 

The  Chinese  Remainder  theorem  is  a  major  tool  in  algorithm  design.  It  is  the  basis  of  the  prime 
factor  algorithm  of  Good  and  Thomas  [11,  12],  It  can  be  stated  in  several  ways,  but  we  will  use  the 
theorem  as  a  statement  about  rings,  especially  the  idempotents,  for  uniformity  and  predictability 
in  implementation. 

C.4  Extension  via  the  Chinese  Remainder  Theorem 

The  Chinese  Remainder  theorem  is  a  major  tool  in  algorithm  design.  It  is  the  basis  of  the  prime 
factor  algorithm  of  Good  and  Thomas  [11,  12].  It  can  be  stated  in  several  ways,  but  we  will  use  the 
theorem  as  a  statement  about  rings,  especially  the  idempotents,  for  uniformity  and  predictability 
in  implementation. 

Theorem  1  Chinese  Remainder  Theorem  [13]. 

Let  N  =  N1N2,  where  the  integers  Ni  and  N2  are  relatively  prime.  Then 

Z/iV  ~  Z/iVi  X  2/N2. 

Rather  than  proving  the  theorem,  we  state  an  explicit  isomorphism  and  its  inverse.  The  mapping 
i>  :ZIN  -*  Z/iVi  X  Z/iV2  defined  by: 

^(n)  =  (n  mod  N\,  n  mod  N2) 

is  an  isomorphism.  The  inverse  is  defined  in  terms  of  the  idempotents.  Let  ej,  62  be  the  elements 
of  ZIN1N2  with 

^(ei)  =  (l,0),  V’(e2)  =  (0,  !)• 

Then  the  mapping  defined  below  is 

Z/Ni  X  Z/N2  Z/N  :  (ni,n2)  (eini  +  e2n2)  modN. 

C.4.1  Good-Thomas  Prime  Factor  Algorithm  for  Z/MP 

Henceforth  we  will  restrict  to  the  case  where  N2  is  a  prime  number.  Set  N2  =  P  and  Ni  =  M. 
The  system  of  idempotents  in  this  case  will  be  given  according  to  the  residue  of  M  by  P. 

Theorem  2  Let  M  =  c  mod  P.  Then  62  =  c'^M,  where  is  the  inverse  of  c  £  U{ZIP),  the 
multiplicative  group  of  units  ofZjP. 
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Proof:  M  =  c  +  mP  for  some  m  €  Z. 

c~^M  =  c"’(c  +  mP)  =  1  mod  P,  c~'^  M  =  0  mod  M. 

Thus  'tj;{c~'^M)  =  (0, 1).  Since  ei  +  62  =  1  mod  MP,  we  have  that  ei  =  MP  +  1  -  £2. 

Example:  3  and  5  are  relatively  prime  to  each  other.  We  will  find  the  idempotents  for  the 

isomorphism  Z/15  —  x  Z/5.  3  =  3  mod  5.  3  '  =  26  Z/5.  Thus. 

62  =  2  •  3  =  6  6  Z/15,  ei  =  15  +  1  -  62  =  10  G  Z/15. 

We  also  have  the  isomorphism  Z/15  Tjjh  x  Z/3.  5  =  2 modZ.  2  '  =  26  Z/3.  Thus, 

62  =  2  •  5  =  10  6  Z/15,  61  =  15  +  1  -  62  =  6  6  Z/15. 


Indexing  Z/MP  by  the  CRT,  DFT  on  Z/MP  is  computed  by  P(P)  ®  P(M),  where  P(X) 
denotes  the  I-point  DFT  matrix  and  ®  denotes  the  tensor  product  of  matrices.  Many  formulations 
of  the  Prime  Factor  Algorithm  (PFA)  exist  [14,  15,  16,  17],  but  the  explicit  use  of  idempotents  to 
arrive  at  the  tensor  product  decomposition  can  can  be  found  in  [1,  18].  We  will  formulate  the  PFA 
for  two  factors  directly  here  since  the  derivation  is  easy  and  understanding  the  role  of  idempotents 
has  a  direct  impact  on  parallel  implementation. 

To  derive  the  tensor  product  decomposition  in  the  language  of  matrices,  we  will  begin  by 
describing  two  distinct  orderings  of  the  group  Z/MP.  Let  {61,62}  be  the  idempotents  for  the 
isomorphism  Z/MP  ~  Z/M  x  Z/P.  The  Mowing  presentations  for  the  elements  of  Z/MP  are 
unique. 

X  6  Z/MP,  X  =  77161  +  ae2,  0  <  m  <  M,  0  <  a  <  P.  (6) 

yeZIMP,  y  =  pP  +  aM,  0  <  ju  <  M,  0  <  a  <  P.  (7) 

•  Order  Z/MP  antilexicographically  by  the  pair  (m,a)  obtained  by  the  presentation  of  the 
elements  of  Z/MP  given  in  (  6).  We  wiU  use  this  to  order  the  input  data. 

•  Order  Z/MP  antilexicographically  by  the  pair  (/i,a)  obtained  by  the  presentation  of  the 
elements  of  Z/MP  given  in  (  7).  We  will  use  this  to  order  the  output  of  the  Fourier  transform 
computation. 


p_l  M-l 

J{^P  +  aM)='£  + 

a=0  m=0 

Recall  that  61  =  1  mod  M  and  62  =  1  mod  P.  Since 
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we  have  that 

f(pP  +  aM)  = 

a—0  771=0 

For  a  function  /  defined  on  Z/MP,  denote  by  /  the  vector  of  values  f(x)  ordered  by  (  6). 
Denote  by  /  the  vector  of  the  Fourier  transform  of  /  ordered  by  (  7).  We  can  express  (  8)  in  terms 
of  matrices  as  follows. 


F-l  M-1 


+ae2)e' 


27ri 

■TM  Mm 


(8) 


/  =  [F(P)0P(M)]/. 


C.4.2  The  Hybrid  Good-Thomas  and  RTA  Algorithm  on  A  -  ZfN  x  ZjMP  x  Z/AP 


Let  A  =  Z/iV  X  Z/MP  x  Z/KP,  for  natural  numbers  N,M,K  and  a  prime  number  P  such 
that  GCD{M,P)  =  GCD{K,P)  =  1.  By  applying  the  CRT  twice,  we  have  the  isomorphism 

A  ~  Z/iV  X  Z/M  X  Z/P  X  Z/A' X  Z/P. 

For  /  €  A(A)  and  (u,  v,  w)  €  A,  the  Fourier  transform,  /,  is  defined  by 


P/\ -1  PM-i  N-i 

f{u,v,w)=  E 

z=0  y=0  x=Q 


-2vi^  g-2)ri _ 


Set 

g{n,  m,  fc,  a,  b)  =  f{n,  exm  +  cqu,  fxk  +  (9) 

where  {ci,  62}  is  the  system  of  idempotents  for  the  isomorphism  Z/MP  ~  Z/M  X  Z/P  and  {/i^/2} 
is  the  system  of  idempotents  for  the  isomorphism  ZjKP  ~  Z/A'  x  Z/P.  We  can  compute  /  by 
computing  g  since 

/(p,  pP  +  oM,  kP  +  /3M)  =  g{u,  p,  k,  a,  /?). 


In  the  previous  section,  we  described  an  algorithm  for  the  case  of  an  index  set  A  —  Z/iVxZ/PxZ/P. 
The  same  ideas  can  be  applied  to  the  index  set  A  ~  Z/N  x  Z/M  x  Z/P  x  Z/A^  x  Z/P.  If  iV,  M 
and  K  are  powers  of  2,  the  RTA  algorithm  can  be  used  to  decompose  the  data  set  into  independent 
computations  that  can  be  performed  on  each  of  the  P  +  1  (or  P^  +  P)  processors.  The  algorithm 
remains  essentially  the  same,  with  the  ID  N-point  DFT  kernel  now  replaced  by  the  3D  A  x  M  x  K 
DFT  kernel.  The  additional  data  re-indexing  defined  by  equation  (9)  can  be  incorporated  into  the 
computation  of  the  periodizations  with  respect  to  the  sets  Z/P  x  Z/P  during  the  first  step.  In 
is  interesting  to  note  that  with  the  appHcation  of  the  CRT,  the  resulting  hybrid  algorithm  now 
computes  the  DFT  on  sets  of  lines  that  are  parallel  to  the  lines  of  Figure  1  as  shown  in  Figure  3 
for  the  case  P  =  3,  M  —  K  =  2. 


C.4.3  The  parallel  hybrid  algorithm  using  P  d-  1  processors 

The  parallel  algorithm  for  the  computation  of  the  3D  Fourier  Transform  of  a  complex  function 
defined  on  the  index  set  A  =  Z/N  x  Z/MP  x  Z/A'P  is  given  beUow: 
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P3  P3 
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V^ 


\  : 


I  s  jT  ,  \  y 

^  ,  N  ^  \ 

'  V  ✓  i  \  X 

J  s  ✓  ,  S  ✓  pi 

0  ^0  (p  o  JO 

I  N  /  I  N  / 

1  A  V 

✓  \  I  /  \ 

Q  ,0'  (p  0'  b^ 

I  y  \  I  /  ^ 

I  '  ^  ^ 

---;^'---e---0--^'---0--<5---->  PO. 
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Figure  3:  The  output  decimating  lines  for  the  case  P  =  3,  Af  =  A'  =  2. 


Processor  I  (/  =  0, . . . ,  P) 


•  step  1:  Combined  computation  of  Good- Thomas  permutation  and  Periodizations 

for  c  =  0...P  — 1,  6  =  0...  P—1, 

for  n  =  0...A^  — 1,  m  =  Q...M-l,  fc  =  0...A’  — 1, 
if  (/  <  P)  then 

gi{n,  m,  fc,  c)  :=  5/(71,  tti,  k,  c)  +  f{n,  {eim  +  62(0  -  bl)p)MP,  (/i^  +  /2^)/\p) 
else 

5/(72, 772,  k,  c)  ;=  gi{n,  m,  k,  c)  +  /(ti,  (ei772  +  e2b)MPi  {fik  +  /2c)kp) 

where  we  denote  by  (•)£;  the  modulo  P  operation.  Note  that  at  this  step  every  processor  needs 
to  access  the  whole  data  set  stored  in  the  array  f{N,MP,KP)  and  at  the  end  produces  an 
N  X  M  X  K  X  P  array  containing  the  periodized  data  with  respect  to  the  line  1. 

•  step  2:  Computation  of  P  3D  FFTs  of  size  N  x  M  x  li 

for  c  =  0 ...  P  —  1 

m  -  T-  .  — 27rtnn  — 27rtmm  —7nikk 

g,{n,m,k,c)=  Efc,‘  EL'o  Efo  Sl(n,  *,  k,  g-r- 

•  step  3:  Computation  of  (N  •  M  •  ID  P-point  DFTs 

for  n  =  0...A^  — 1,  m  =  0...M^l,  k^O.^.Ii—1, 


f{n,m,k)  =  '£,^^Q  gi{n,m,k,c)e  ‘p 


— 2n-tcc 
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Note  that  the  same  ideas  can  be  also  employed  to  compute  the  2D  DFT  for  a  function  defined  on 
the  index  set:  A  =  ZjMP  x  ZIKP.  The  parallel  hybrid  algorithm  for  the  case  of  P  +  1  nodes  is 
given  bellow: 

Processor  /  (/  =  0, . . . ,  P) 

•  step  1:  Combined  computation  of  Good-Thomas  permutation  and  periodizations 

for  c  =  0...P-l,  6  =  0...P-1, 

for  m  =  0  . . .  M  —  1,  k  0  . . .  R  —  1, 
if  (/  <  P)  then 

gi{m,  k,  c)  :=  gt{m,  k,  c)  +  /{{eim  +  62(0  -  bl)p)MP,  ifik  +  f2b)Kp) 
else 

gi{m,  k,c)  :=  gi{m,k,c)  +  /((eim  +  e2b)MP,  {fik  +  /2c)a'f) 

The  array  gi{M,K,P)  contains  now  the  periodized  data  with  respect  to  the  line  1. 

•  step  2:  Computation  of  P  2D  FFTs  of  size  M  X  K 

for  c  =  0 . . .  P  -  1  compute 

,  ,  ,  T.'  -•  ^  -  — 27r*mm  —27rikk 

9,(m,  k,  0)  =  Et;  «'(*■*• 

•  step  3:  Computation  of  {M  ■  K)  P-point  DFTs 

for  TO  =  0  ...  M  -  1,  k  =  0  ..  .K  -  I,  compute 
/(to,  k)  =  Ef=o  k,  c)e  p 

C.5  Implementation  issues 

C.5.1  The  Intel  iPSC/860  Hypercube 

The  Intel  iPSC/860  parallel  processing  system  is  a  distributed  memory,  Multiple  Instruction  Mul¬ 
tiple  Data  (MIMD)  hypercube,  containing  up  to  128  =  2^  compute  nodes  (processing  elements, 
PEs)  based  on  the  Intel  i860  high  performance  64-bit  RISC  microprocessor.  The  i860  has  a  peak 
performance  of  80  MFlops  and  is  equipped  with  8K  data  and  4K  instruction  cache  memory.  Each 
node  has  8  to  64  Mbytes  of  external  local  memory,  a  network  interface  and  a  message  router.  The 
router  can  handle  up  to  8  bidirectional  communication  channels,  seven  of  which  may  be  connected 
to  neighboring  nodes  and  one  is  dedicated  to  external  I/O  and  is  directly  connected  to  the  host 
processor. 

The  PEs  are  connected  to  each  other  via  relatively  slow  full  duplex  asynchronous  commu¬ 
nication  channels  that  can  carry  messages  of  variable  length.  The  channel  bandwidth  is  about 
2.8Mbytes/sec.  The  wormhole  routing  technique,  which  minimizes  the  delay  between  receiving  a 
message  in  a  node  and  retransmitting  it  to  its  final  destination,  is  used.  The  message  passing  can 
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be  either  synchronous  or  asynchronous.  The  synchronous  message  passing  blocks  the  execution  of 
the  node  programs  until  the  communication  has  been  completed,  whereas  asynchronous  message 
passing  returns  immediately  and  is  useful  if  the  node  processors  can  perform  other  computations 
while  waiting  for  the  communication  to  complete.  The  system  is  equipped  with  a  Concurrent  File 
System  (CFS)  [19]  that  distributes  files  across  all  available  disks  in  blocks,  such  that  different  com¬ 
pute  nodes  can  access  different  parts  of  a  file  without  creating  a  bottleneck  at  a  particular  I/O 
node. 

C.5.2  Initial  data  loading  and  distribution 

The  hybrid  Reduced  Transform  Algorithm  (HRTA)  that  we  propose  requires  that  the  whole  data 
set  is  accessible  from  aU  the  nodes,  so  that  all  periodizations  with  respect  to  the  line  assigned  to 
a  PE  can  be  computed.  This  does  not  necessarily  mean  that  every  node  has  to  store  the  whole 
data  set,  although  the  latter  could  be  helpful  in  certain  environments.  The  traditional  row-column 
(RC)  algorithm  on  the  other  hand  requires  each  node  to  have  access  to  only  a  subset  of  the  rows 
or  the  columns  of  the  original  data  array,  but  a  severe  communications  overhead  is  introduced  by 
the  need  to  perform  one  or  more  global  transpositions  of  the  data. 

Data  entry  to  the  multi-processor  machine  depends  on  the  particular  application  in  which  the 
DFT  is  embedded.  While  in  some  applications  the  data  are  stored  in  the  disk(s)  and  have  to  be 
imported  to  the  nodes  of  the  parallel  machine,  in  other  applications  the  data  have  already  been 
imported  during  previous  computational  stages  or  have  been  generated  locally  in  the  nodes.  Since 
the  initial  data  loading  is  application  dependent,  we  have  not  investigated  the  implementation 
of  this  step  in  detail.  We  have  however  considered  two  different  models  for  the  initial  stage  of 
the  HRTA:  In  the  first  model,  which  is  referred  to  as  the  master-slaves  method,  a  master  node 
computes  aU  the  periodizations  and  sends  to  the  other  nodes  (the  slaves)  only  the  periodized  data. 
Using  this  method  the  need  for  storage  on  the  nodes  is  reduced  since  every  one  has  to  store  only 
the  periodized  data.  Furthermore  the  computation  of  the  periodizations  by  the  master  node  can 
be  performed  in  a  way  that  interleaves  computation  and  communication  steps  in  order  to  achieve 
optimum  performance. 

In  the  second  model,  also  referred  to  as  the  multi-processor  model,  aU  nodes  have  access  to 
all  the  data  set,  so  that  in  an  initial  loading  phase,  either  all  nodes  access  a  shared  file  system 
concurrently,  or  one  node  reads  the  data  from  a  file  and  then  broadcasts  them  to  aU  the  other 
nodes.  Although  the  HRTA  requires  that  larger  (than  for  the  RC  method)  data  sets  be  sent  to  the 
nodes,  the  fact  that  these  data  sets  are  the  same  allows  for  the  use  of  the  broadcasting  capabilities 
of  the  parallel  machine.  This  approach  is  especially  attractive  for  shared  bus  based  machines  where 
broadcasting  can  be  performed  efficiently.  On  the  iPSC/860  hypercube  broadcasting  the  whole 
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data  set  is  faster  than  sending  “chunks”  of  different  data  elements  to  each  processor,  but  certainly 
much  slower  than  an  one-hop  away  communication  step.  After  extensive  experimentation  with  the 
iPSC/860  we  concluded  that  the  master-slaves  model  allows  for  more  efficient  implementations  of 
the  HRTA  than  the  multi-processor  model. 

Since,  for  the  hybrid  algorithm,  the  node  computations  are  completely  independent,  there 
is  very  little  need  for  synchronization  among  the  nodes.  Therefore  completely  asynchronous  im¬ 
plementations  that  exploit  the  MIMD  nature  of  the  machine  and  allow  each  node  to  perform  its 
computations  as  soon  as  the  data  are  received  are  possible.  On  the  other  hand,  in  the  row-column 
method,  that  is  the  most  commonly  used  method  today,  a  series  of  distributed  global  data  trans¬ 
positions  has  to  be  performed,  and  its  efficiency  is  highly  dependent  on  the  tight  synchronization 
among  the  processors.  Therefore,  the  increased  need  for  communication  during  the  loading  phase 
that  the  hybrid  algorithm  has  does  not  make  it  slower  than  the  row-column  method,  unless  more 
sophisticated  methods  for  data  distribution  can  be  employed.  (We  intend  to  explore  this  issue  in 
detail  by  investigating  the  capabilities,  advantages  and  drawbacks  of  the  CFS  that  the  iPSC/860 
supports). 

C.5.3  Reporting  the  results  to  the  host 

The  final  phase  of  reporting  results,  as  well  as  the  initial  phase  of  loading  data  depend  on  the  DPT 
application.  In  some  applications  upon  completion  of  the  DPT  the  results  need  not  be  reported 
back  to  the  host  since  they  are  further  processed.  In  others,  it  is  desired  to  store  aU  the  DPT  values 
in  the  external  disk  memory.  In  the  parallel  HRTA  we  propose,  the  distribution  of  the  results  on 
the  nodes  is  according  to  the  lines  they  belong  to.  Whereas  in  some  applications  it  is  desired  to 
organize  the  results  in  the  same  order  as  the  original  data,  in  others  it  is  essential  to  return  the 
results  along  subsets  of  the  original  index  space  (lines  or  planes)  [10],  [9].  Since  the  final  reporting 
phase  is  highly  dependent  on  the  application,  we  have  not  investigated  this  issue  in  detail.  We 
would  like  to  mention  however  that  the  limited  synchronization  needs  of  the  HRTA  leads  to  flexible 
implementations  of  the  final  reporting  phase,  because  the  nodes  can  finish  their  computations 
independently  and  start  returning  their  results  asynchronously  as  soon  as  they  become  available. 

C.6  Implementation  Results 

It  has  been  a  common  belief  among  the  signal  processing  community  that  with  the  pipelining 
and  dual  operations  capabilities  of  the  modern  RISC  microprocessors,  there  is  no  need  for  DPT 
algorithms  for  data  sizes  that  are  not  a  power  of  two.  This  was  so  because  zero  padding  can  be 
employed  along  with  the  highly  optimized,  microprocessor  specific,  power-of-two  PPT  routines. 
As  we  wiU  show  here,  this  is  not  true  for  multi-dimensional  DFTs.  Zero  padding  along  many 
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dimensions  can  increase  the  data  size  tremendously  and  reduce  the  efficiency  of  the  power-of-two 
routines  drastically.  Moreover,  in  a  multiprocessor  environment,  the  standard  power-of-two  Row- 
Column  (RC)  based  FFT  algorithms  require  one  or  more  global  transposition  steps  in  which  all 
processors  need  to  communicate  with  every  other  processor  in  the  network.  Due  to  the  limited 
bandwidth  of  the  communication  links,  the  global  transposition  steps  result  in  a  bottleneck  that 
severely  limits  the  maximum  achievable  speedup. 

C.6.1  The  2D  DFT  case,  MP  x  KP 

To  demonstrate  the  advantages  of  the  proposed  hybrid  RTA  algorithm  (HRTA)  relative  to  the 
traditional  row-column  (RC)  power-of-two  algorithm,  we  compare  an  implementation  for  the  2D 
DFT  case  with  a  highly  optimized  Intel  iPSC/860,  vendor  supplied  RC  implementation  for  the  case 
P  =  3,  using  L  =  P  +  1  =  4  nodes.  The  HRTA  periodization  step  was  coded  in  Fortran,  whereas 
for  the  2D  FFTs  we  used  vendor  supplied,  assembly  coded,  power-of-two  FFT  routines,  optimized 
for  the  i860  processor.  Finally,  for  the  3-point  DFTs  step  we  also  used  optimized,  hand  coded  in 
assembly,  vectorized  routines.  We  performed  several  tests  for  various  non-power-of-two  data  sizes 
and  we  report  the  computational  time  achieved  by  both  methods.  The  time  is  measured  from 
the  point  that  all  the  necessary  data  already  reside  in  the  nodes,  and  until  the  results  have  been 
computed  and  stored  in  the  processors  local  memory.  In  both  implementations  the  distribution 
of  the  results  is  different  from  the  original  data  distribution.  Using  the  HRTA  the  resiilts  are 
distributed  along  the  lines  assigned  to  each  processor,  and  using  the  RC  method  the  results  are 
distributed  in  a  transposed  fashion. 

In  Table  1  we  compare  the  speed  of  the  two  algorithms  for  a  variety  of  data  sizes.  Depending 
on  the  amount  of  zero  padding,  the  RC  method  could  be  up  to  about  70%  slower  than  the  HRTA. 
Moreover,  our  HRTA  implementation  can  be  further  optimized  (assembly  coding  of  the  periodiza¬ 
tion  step),  whereas  the  RC  implementation  is  already  fuUy  optimized  for  the  Intel  iPSC/860.  As 
we  can  see  from  Table  1,  the  speedup  over  the  RC  method  increases  with  the  data  size  as  expected, 
since  the  amount  of  zero-padding  increases  with  the  size  of  the  original  non-power-of-two  data  set 
as  well. 

As  an  Indication  of  the  percentage  of  time  spent  on  each  one  of  the  three  major  computational 
ta.sks  we  refer  to  the  case:  M  =  2-56,  K  =  256,  P  =  3,  (size  768  x  768).  The  times  (in  msec)  for  the 
computation  of  MKP  =  3-2^®  periodizations,  P  =  3  M  x  K  =  256  x  256  2D  FFTs  and  MK  =  2^® 
3-point  DFTs  respectively  are:  tp  =  475.0208,  =  542.0287  and  t^ftsp  =  97.6183.  As  we  can 

see,  the  time  required  for  the  periodizations  almost  equals  that  for  the  2D  FFTs.  A  careful  assembly 
coding  of  the  periodizations  step  is  expected  to  reduce  this  time  by  at  least  50%,  thus  making  an 
optimized  HRTA  implementation  twice  as  fast  as  the  optimized  RC  implementation. 


# 
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m 


Hybrid  Algorithm 

Row-Column  Method 

MP  X  KP 

time  (msec) 

size 

time  (msec) 

192  X  192 

66.7470 

256  X  256 

95.9750 

192  X  384 

130.8893 

256  X  512 

194.5686 

384  X  384 

254.7608 

512  X  512 

403.3229 

384  X  768 

511.9606 

512  X  1024 

866.7061 

768  X  768 

1117.2697 

1024  X  1024 

1876.8777 

Table  1:  Comparison  of  the  performance  of  the  HRTA  parallel  algorithm  vs.  the  iPSC/860  opti¬ 
mized  RC  parallel  algorithm  implementation,  for  various  data  sizes,  and  P  =  3.  In  both  methods 
the  data  are  assumed  to  initially  reside  in  the  nodes. 

C.6.2  The  3D  DFT  case,  N  x  MP  x  KP 

We  have  implemented  the  parallel  HRTA  algorithm  for  the  case  P  =  3,  onT  =  P-|-l  =  4  nodes, 
where  N,  M  and  K  are  assumed  to  be  a  power  of  two.  In  the  3D  case,  the  periodization  step 
can  be  organized  to  result  in  a  much  more  regular  memory  access  than  in  the  2D  case,  since  now 
vector  additions  of  data  stored  in  consecutive  memory  locations  can  be  employed.  We  coded  this 
step  using  a  mixture  of  Fortran  and  vector  addition  assembly  routines,  whereas  assembly  routines 
have  been  used  for  both  the  3D  FFTs  and  the  3-point  DFTs. 

In  Table  2  we  compare  the  HRTA  with  the  optimized  iPS C/860  implementation  of  the  RC 
algorithm  for  a  variety  of  data  sizes.  Using  both  methods  the  data  initially  reside  in  the  nodes, 
and  the  time  is  measured  up  to  the  point  that  the  results  have  been  computed  and  stored  in  the 
local  memory.  As  we  can  see  from  Table  2,  the  RC  algorithm  is  on  the  average  about  70%  slower 
than  the  HRTA  algorithm  for  a  good  mix  of  the  cases  tested.  In  the  same  table  we  also  report  the 
computational  times  required  for  the  DFT  of  the  same  data  set  using  the  RC  method  on  8  nodes. 
It  is  interesting  to  observe  that  even  if  the  number  of  nodes  is  doubled  the  performance  is  increased 
by  only  15%  on  the  average  relative  to  the  4-node  HRTA  implementation. 

In  Figure  4  we  compare  our  implementation  of  the  parallel  HRTA  with  the  parallel  RC  method 
by  plotting  the  (base  2)  logarithm  of  the  computational  times  required  by  both  methods  for  data 
sizes  N  X  96  X  96,  versus  log  A.  In  the  same  figure  we  plot  the  ratio  of  the  computational  times 
(^^speedup”)  as  well.  As  we  can  see  the  RC  method  can  be  as  much  as  1.70  times  slower  than  the 
HRTA  for  the  range  of  N  examined. 

As  an  indication  of  the  percentage  of  computational  time  spent  in  each  stage  of  the  HRTA,  we 
report  the  times  (in  msec)  required  for  the  major  tasks  involved  when  the  data  size  is  16  x  192  x  192 
(i.e.  A  =  16,  P  =  3,  M  =  K  =  64).  In  this  case  we  need  to  compute:  NMKP  =  3-2*® 
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Hybrid  Algorithm 

Row-Column  Method 

N  X  MP  X  KP 

time  (msec,  4-PEs) 

size 

timel  (msec,  4-PEs) 

time2(msec,  8-PEs) 

8  X  96  X  96 

183.6536 

8  X  128  X  128 

289.6318 

150.3773 

8  X  96  X  192 

351.6283 

8  X  128  X  256 

592.9389 

298.6226 

8  X  192  X  192 

719.3023 

8  X  256  X  256 

1278.7431 

628.8135 

8  X  192  X  384 

1522.1512 

8  X  256  X  512 

2568.7333 

1262.5482 

16  X  96  X  96 

338.0456 

16  X  128  X  128 

565.8212 

273.1891 

16  X  96  X  192 

690.1413 

16  X  128  X  256 

1134.1762 

556.7175 

16  X  192  X  192 

1422.3148 

16  X  256  X  256 

2245.3022 

1175.6518 

4  X  384  X  384 

1791.1266 

4  X  512  X  512 

2941.7306 

- 

Table  2:  Performance  comparison  of  the  3D  HRTA  parallel  algorithm  vs.  the  iPSC/860  optimized 
RC  parallel  implementation,  for  a  variety  of  data  sizes  and  P  =  3.  In  both  methods  the  data 
are  assumed  to  initially  reside  in  the  nodes.  For  the  RC  method  we  report  both  the  4-nodes  and 
8-nodes  time. 

periodizations,  P  =  3  iV  x  M  x  A  =  16  x  64  x  64  3D  FFTs,  and  N  iV/A  =  2^^  3-point  DFTs.  The 
corresponding  times  are:  tp  =  302.2875,  tfjtsj,  =  1022.5539  and  tdjtsp  —  97.7224.  It  is  interesting 
to  notice  that  although  the  number  of  periodizations  is  the  same  (3  •  2^®)  as  for  the  2D  DFT  case 
(768  X  768)  discussed  in  the  previous  subsection,  tp  is  reduced  by  about  35  %.  This  is  because  in  the 
3D  DFT  case,  accesses  to  the  data  array  are  more  localized  than  in  the  2D  case  since  periodizations 
are  computed  only  along  the  two-dimensional  planes.  As  we  can  see,  the  3D  FFTs  computation  is 
stiU  the  most  expensive  task.  A  3D  FFT  (16  X  64  x  64),  although  applied  to  a  data  set  with  the 
same  number  of  elements  as  in  the  2D  case  (256  x  256),  is  two  times  slower  than  a  2D  FFT.  This 
is  mainly  due  to  the  fact  that  the  3D  FFT  requires  more  function  calls  to  the  optimized  ID  FFT 
routine  as  well  as  additional  transposition  steps.  The  assembly  coded  3-point  DFT  is  again  as  fast 
as  in  the  2D  case.  The  large  percentage  of  the  computational  time  that  the  3D  FFT  requires  makes 
us  to  believe  that  trying  to  limit  the  need  for  large  3D  FFTs  is  more  important  than  optimizing 
the  periodizations. 

In  Table  3  we  report  execution  times  that  include  the  initial  data  loading  phase.  In  both 
implementations  the  data  are  assumed  to  initially  reside  in  one  node  which  then  distributes  them 
to  aU  the  others.  For  the  hybrid  method  we  used  the  master-slaves  model,  described  in  section  4, 
that  works  as  follows:  The  master  PE  performs  all  the  periodizations;  as  soon  as  one  periodization  is 
completed,  the  results  are  sent  via  non-blocking  communications  to  a  slave  PE  and  the  computation 
of  the  next  periodization  can  start  in  the  master  PE.  The  slave  node  that  receives  the  periodized 
data  can  proceed  with  the  3D  FFTs.  This  interleaving  between  computations  and  communications 
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Figure  4:  Performance  comparison  of  the  4-node  3D  HRTA  parallel  algorithm  vs.  the  3D  RCA 
method.  Left:  plots  of  the  (base  2)  log.  of  the  computational  time  (in  milliseconds)  versus  logiV. 
For  the  HRTA  the  data  sizes  used  were  of  the  form  A  x  96  x  96  and  for  the  RC  method  the 
corresponding  sizes  were  zero  padded  to  A  X 128  X 128.  Right:  the  ratio  of  times  T)-c/T;irta  (speedup) 

achieves  optimum  performance  using  the  HRTA.  In  the  row-column  method  each  one  of  the  four  PEs 
needs  only  i  of  the  data  set.  Including  the  data  loading  phase  leads  to  even  larger  improvements 
over  the  RC  method.  This  is  due  to  the  asynchronous  nature  of  the  hybrid  method  implementation 
that  allows  data  loading  in  a  pipelined  fashion  to  further  reduce  the  total  DFT  time. 

The  final  reporting  of  the  results  to  the  master  node,  can  also  be  done  in  pipelined  fashion. 
The  nodes  do  not  finish  their  computations  aU  at  the  same  time.  The  master  node  finishes  first;  it 
can  then  re-shuffle  its  own  data  back  into  order  and  then  receive  messages  from  the  other  nodes. 
As  soon  as  each  node  finishes  its  computation,  it  can  return  its  part  of  the  results  to  the  master 
node.  On  the  other  hand,  in  the  RC  method  aU  nodes  finish  almost  simultaneously  and  the  total 
reporting  time  will  be  the  sum  of  the  times  required  by  each  individual  node  to  return  its  results 
to  the  master  node.  As  we  can  see  from  Table  3  (column  labeled  time2)  when  the  final  reporting 
stage  is  included  the  advantage  of  the  HRTA  becomes  even  greater. 

C.6.3  The  hybrid  algorithm  implementation  for  larger  sizes  of  P 

In  this  subsection  we  present  preliminary  results  on  the  performance  of  the  HRTA  implementations 
for  3D  DFTs  of  sizes  N  x  MP  x  K P,  where  the  prime  number  is  P  =  5  or  P  =  7.  In  Tables  4  and 
5  we  report  execution  times  in  six  and  eight  nodes  respectively. 

In  Figure  5  we  plot  the  computational  time  versus  the  size  of  the  problem  as  well  as  the 
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Hybrid  Algorithm  (4  nodes) 

Row-Column  Method  (4  nodes) 

N  X  MP  X  KP 

timel  (msec) 

time2  (msec) 

size 

timel  (msec) 

time2  (msec) 

64  X  48  X  6 

121.89 

146.97 

64  X  64  X  8 

148.27  (-^21.64%) 

229.92  (+56.44  %) 

64  X  96  X  6 

210.28 

290.53 

64  X  128  X  8 

291.01  (-h38.39%) 

451.78  (+55.50%) 

128  X  96  X  6 

477.11 

573.18 

128  X  128  X  8 

587.26  (-t-23.08%) 

905.37  (+57.95%) 

128  X  192  X  6 

940.17 

1136.35 

128  X  256  X  8 

1214.48  (-h29.17%) 

1882.26  (+65.64%) 

16  X  96  X  96 

791.75 

1007.59 

16  X  128  X  128 

1139.49  (-1-43.92%) 

1764.86  (+75.16%) 

Table  3:  Comparison  of  the  performance  of  the  3D  HRTA  vs.  the  RC  method.  The  data  initially 
reside  in  one  master  node;  timel  includes  the  data  distribution  whereas  time2  includes  in  addi¬ 
tion  the  final  reporting  to  the  master  node.  For  the  RC  method,  the  percentages  in  parenthesis 
correspond  to;  100  • 


Hybrid  Algorithm  (6-nodes) 

Row-Column  Method  (8-nodes) 

N  X  MP  X  KP 

time  (msec) 

size 

time  (msec) 

8  X  160  X  320 

745.4254 

8  X  256  X  512 

1262.5699 

8  X  160  X  160 

388.6131 

8  X  256  X  256 

628.7040 

16  X  80  X  160 

395.3691 

16  X  128  X  256 

556.5085 

32  X  80  X  80 

392.0486 

32  X  128  X  128 

587.5301 

64  X  40  X  80 

395.1672 

64  X  64  X  128 

577.1927 

128  X  40  X  40 

414.7549 

128  X  64  X  64 

578.8189 

2048  X  10  X  10 

856.1560 

2048  X  16  X  16 

714.3222 

Table  4:  Comparison  of  the  hybrid  algorithm  implementation  and  the  row-column  method,  for 
P  =  5. 

speedup  ratio  over  the  RC  method  applied  to  a  data  set  zero  padded  up  to  the  next  power  of  two 
in  each  dimension.  As  we  can  see  from  Figure  5  although  the  optimized  RC  algorithm  runs  on 
8-nodes,  instead  of  6  for  the  HRTA  algorithm,  it  is  about  1.5  times  slower  than  the  non-optimized 
HRTA  implementation. 

C.6.4  The  node  clustering  approach 

As  we  have  seen  earlier  in  the  four-node  3D  DFT  case,  each  node  needs  to  perform  three  3D  FFTs 
of  size  M  X  K  X  N.  If  the  size  of  the  3D  FFT  data  is  too  large  to  fit  into  a  single  node,  or  faster 
implementations  are  desired,  the  four  nodes  can  now  be  considered  as  four  conceptual  clusters  of 
nodes.  In  each  of  the  clusters,  the  3D  data  is  distributed  along  the  first  dimension,  and  both  the 
periodization  and  3-point  DFT  steps  can  be  performed  independently  by  each  node  of  a  cluster. 
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m 


# 


Hybrid  Algorithm  (8-nodes) 

Row-Column  Method  (8-nodes) 

N  xMPx  KP 

time  (msec) 

size 

time  (msec) 

8  X  224  X  224 

647.1830 

8  X  256  X  256 

628.7040 

16  X  112x224 

595.7953 

16  X  128  X  256 

556.5085 

32  X  112  X  112 

584.6240 

32  X  128  X  128 

587.5301 

64  X  56  X  112 

567.5866 

64  X  64  X  128 

577.1927 

128  X  56  X  56 

577.8302 

128  X  64  X  64 

578.8189 

2048  X  14  X  14 

1109.0271 

2048  X  16  X  16 

714.3222 

Table  5:  Comparison  of  the  hybrid  algorithm  implementation  and  the  row-column  method,  for 
P  =  7. 

For  the  3D  FFT  computation,  only  communication  among  the  processors  of  the  same  cluster  is 
needed,  thus  greatly  reducing  the  total  communication  requirements.  In  Figure  6  we  show  how  an 
eight-node  hypercube  is  organized  in  4  clusters  to  compute  the  3D  DFT  of  size  N  x  3M  x  3A  ■ 
Node  clustering  can  be  used  to  create  scalable  implementations  that  make  fuU  utilization  of 
the  available  hardware.  If  the  number  of  nodes  is  2"  and  four  clusters  are  used,  every  node  needs 
to  store  only  the  ^  of  the  original  data  set.  In  Tables  6,  7  and  8  we  present  timing  results  for 
8,  16,  and  32  nodes  4-cluster  implementations  (F  =  3)  and  compare  the  performance  of  the  HRTA 
with  that  of  the  highly  optimized  row-column  method  using  zero-padding  up  to  the  next  power  of 
two  in  every  dimension.  It  is  again  assumed  that  the  data  already  reside  in  the  nodes  before  the 
processing  starts.  Moreover,  the  three  3D  FFTs  computed  by  every  cluster  are  implemented  using 
the  optimized  row-column  routines. 


Hybrid  Algorithm 

Row-Column  Method 

N  X  MP  X  KP 

time  (msec) 

size 

time  (msec) 

16  X  192  X  192 

1027.9498 

16  X  256  X  256 

1175.2196 

8  X  384  X  192 

1052.3800 

8  X  512  X  256 

1330.3198 

8  X  384  X  384 

2206.7041 

8  X  512  X  512 

2669.4727 

Table  6:  HRTA  in  8-nodes  =  4  clusters  of  2  PEs/cluster  vs.  8-nodes  RC  with  zero  padding 

In  Figure  7,  we  plot  the  computational  time  required  by  each  implementation,  versus  log  iV 
for  data  sizes  of  the  form  N  x  96x  96.  In  the  same  figure  we  also  show  the  ratio  TrdThrta  before. 
As  we  can  see  from  Figure  7,  the  hybrid  algorithm  is  only  slightly  better  than  the  row-column 
method.  However,  the  periodization  part  of  our  code  is  just  in  standard  Fortran  implementation 
and  it  can  be  further  optimized. 

In  Table  7  we  present  timing  results  for  a  16-node  implementation;  P  -f- 1  =  4  clusters  (with 
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log2(lime)  (msec)  N « 40  x  40:  S-nodes  speodip  N  x  40  x  40: 6-nodes 


log2(N)  log2(N) 


Figure  5:  Comparison  of  the  6-node  HRTA  to  the  8-node  RC  implementation:  Left:  Plots  of  the 
(base  2)  logarithm  of  the  computational  time  (in  msec)  vs.  logiV.  The  data  sizes  used  with  the 
HRTA  were  of  the  form  iV  x  40  x  40,  P  =  5.  The  corresponding  RCA  data  sizes  were  of  the  form 
iV  X  64  X  64.  Right:  the  speedup  ratio  TrdThrta- 

4  PEs/cluster)  cooperate  to  perform  the  3D  DFT.  Each  of  the  cluster  has  the  whole  data  set 
stored  in  it.  Within  a  cluster,  each  of  the  nodes  stores  1/4  of  the  data  (distributed  along  the 
first  dimension).  Three  four-node  row-column  3D  FFTs  are  performed  within  each  cluster.  In 
Figure  8  we  show  the  data  distribution  within  one  of  the  clusters.  As  we  can  see  from  Table 
7,  the  non-optimized  HRTA  implementation  has  comparable  performance  with  the  optimized  RC 
implementation.  The  computational  time  versus  the  size  of  the  data  set  and  the  speedup  ratio  over 
the  row-column  method  is  shown  in  Figure  9.  Finally,  in  Table  8,  we  present  timing  results  for  a 
32-node  HRTA  implementation,  partitioned  into  4  clusters  with  8  PEs  each. 

The  node  clustering  approach  can  be  used  in  general  for  any  size  of  the  prime  number  P .  As  an 


# 


Hybrid  Algorithm 

Row-Column  Method 

N  X  MP  X  KP 

time  (msec) 

size 

time  (msec) 

16  X  192  X  192 

544.8489 

16  X  256  X  256 

569.4243 

16  X  384  X  192 

1103.3147 

16  X  512  X  256 

1204.5177 

16  X  384  X  384 

2289.6839 

16  X  512  X  512 

2411.0360 

8  X  384  X  768 

2373.4150 

8  X  512  X  1024 

- 

Table  7:  HRTA  in  16-nodes  =  4  clusters  of  4  PEs/cluster  vs.  16-nodes  RC  with  zero  padding. 
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Node  0 


Figure  6:  An  8-node  hypercube,  organized  into  P  -|-  1  =  4  clusters  of  2  nodes  each.  Only  near¬ 
neighbor  communications  inside  every  cluster  are  need  to  compute  an  N  X  MP  X  KP  3D  FFT. 

example,  we  have  implemented  the  case  N  x  5M  x  5R  (P  =  5)  in  a  12-node  configuration.  The  12 
nodes  are  partitioned  into  Q  =  P  +  1  clusters,  each  one  having  two  nodes.  The  data  are  distributed 
evenly  within  each  cluster  along  the  first  dimension,  and  the  row-column  3D  FFT  kernel  is  used 
to  perform  the  3D  DFTs  inside  every  cluster.  Only  communication  among  PEs  in  the  cluster  are 
needed.  In  Table  9  we  compare  the  HRTA  implementation  versus  the  RC  method  running  running 
on  16  nodes.  As  we  can  see  the  12-node  HRTA  outperforms  the  16-node  RC  implementation. 


Hybrid  Algorithm 

Row-Column  Method 

N  xMPxKP 

time  (msec) 

size 

time  (msec) 

128  X  96  X  96 

532.9717 

128  X  128  X  128 

589.2343 

64  X  192  X  192 

1039.3567 

64  X  256  X  256 

1173.0707 

32  X  384  X  192 

1071.3325 

32  X  512  X  256 

1216.7244 

32  X  384  X  384 

2208.9439 

32  X  512  X  512 

2444.6264 

Table  8:  HRTA  in  32-nodes  =  4  clusters  of  8  PEs/cluster  vs.  32-nodes  RC  with  zero  padding. 
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Figure  7:  Performance  comparison  of  the  8-node  HRTA  (4  clusters  with  2  PEs/cluster)  vs.  the 
RC  method.  Left:  Plots  of  the  (base  2)  logarithm  of  the  computational  time  (in  milliseconds)  vs. 
logiV.  Right:  the  speedup  ratio  Trc/Thrta- 


Hybrid  Algorithm  (12-nodes) 

Row-Column  Method  (16-nodes) 

N  X  MP  X  KP 

time  (msec) 

size 

time  (msec) 

16  X  80  X  80 

137.7877 

16  X  128  X  128 

139.8977 

16  X  160  X  80 

260.3311 

16  X  256  X  128 

279.9925 

16  X  160  X  160 

512.0763 

16  X  256  X  256 

.567.8107 

16  X  320  X  160 

979.6029 

16  X  512  X  256 

1201.9575 

Table  9:  Comparison  between  a  12-node  HRTA  parallel  algorithm  with  clustering  (6  clusters,  2 
PEs/cluster),  and  the  RC  method  running  on  16  nodes.  The  HRTA  is  faster  although  it  uses  25  % 
less  nodes. 

C.6.5  Conclusions  and  further  Research  directions 

A  new  approach  for  computing  multi-dimensional  DFTs  with  limited  interprocessor  communica- 
tions  has  been  proposed,  and  its  advantages  relative  to  the  standard  row-column  power-of-two 
based  FFT  algorithms  has  been  demonstrated.  Although  it  has  been  a  common  belief  that  with 
the  available  modern  RISC  microprocessors  there  is  no  need  for  new  “exotic”  DFT  algorithms,  we 
have  shown  that  substantial  computational  savings  can  be  achieved  in  a  parallel  environment  by 
using  a  more  flexible  hybrid  scheme.  The  DFT  is  a  major  component  of  numerous  signal  and  image 
processing  applications  and  if  real-time  operation  is  envisioned,  only  parallel  processing  can  satisfy 
the  user  demands. 
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NodeO 


Figure  8:  A  4*node  cluster,  part  of  a  16-node,  4-clusters  hypercube:  Only  communications  inside 
each  cluster  need  to  be  performed  to  compute  a  3D  FFT. 

The  proposed  hybrid  algorithm  combines  the  advantages  of  both  the  recently  proposed  RTA 
and  the  Cooley- Tuckey  RC  method  to  give  optimal  parallel  realizations  for  non-power-of-two  data 
sizes.  We  demonstrated  the  flexibility  and  the  efficiency  of  the  HRTA  by  implementing  it  on  an 
Intel  iPSC/860  hypercube,  where  our  non-optimized  HRTA  realizations  outperform  the  highly  op¬ 
timized  RC  method  realizations.  The  HRTA  provides  an  alternative  that  is  suitable  for  many 
different  parallel  and  distributed  processing  environments.  In  a  DSP  board  with  4  compute  nodes 
communicating  via  a  shared  bus,  the  HRTA  seems  to  be  the  only  viable  parallel  processing  scheme 
that  would  achieve  real-time  performance.  In  Clusters  Of  Workstations  (COWS),  a  rapidly  emerg¬ 
ing  cost-effective  model  for  parallel  computing,  the  need  for  an  all-to-all  communication,  that  is 
necessary  for  transposition  using  the  RC  method,  would  render  the  RC  method  highly  inefficient. 
On  the  other  hand,  the  HRTA  has  little  or  no  need  for  communication  between  different  worksta¬ 
tions  so  that  very  fast  implementations  can  be  created. 

We  have  demonstrated  that  the  HRTA  shares  the  scalability  properties  of  the  RC  algorithm 
so  that  multi- dimensional  DFTs  of  large  data  sizes  can  be  computed  eflflciently  on  parallel  archi¬ 
tectures.  To  optimize  the  HRTA  the  periodization  step  can  be  further  improved.  The  modulo 
arithmetic  based  addressing  can  be  avoided  if  more  local  memory  is  allocated  to  store  two  integer 
arrays  Indl(M,P)  and  Ind2(K,P)  used  as  index-lookup  tables.  Their  entries  can  be  either  computed 
once  or  preloaded  along  with  the  data.  An  alternative  approach  is  to  replace  the  modulo  operations 
with  additions  and  conditional  statements.  Moreover,  since  along  the  index  n,  the  periodizations 
reflect  essentially  to  vector  additions,  efficient  assembly  language  modules  that  make  full  use  of  the 
pipelining  capabilities  of  the  i860’s  RISC  architecture  can  be  employed.  Therefore,  the  larger  the 
N  the  better  the  use  of  the  CPU  pipelining  capabilities  and  of  the  cache  memory.  Along  the  other 
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Figure  9:  Performance  comparison  of  the  16- node  (4  clusters  with  4  PEs/cluster)  HRTA  vs.  the 
16-nodes  RC  method.  Left:  Plots  of  the  (base  2)  logarithm  of  the  computational  time  (in  msec) 
vs.  log  N ,  The  data  sizes  used  are  of  the  form  iV  x  96  x  96,  zero  padded  to  N  x  128  x  128  for  the 
RC  method.  Right:  the  speedup  ratio  TrdThrta 

indices,  the  memory  addresses  to  be  referenced  do  not  follow  a  sequential  pattern,  so  that  extra 
care  must  be  taken  to  prefetch  the  necessary  data  before  they  are  needed. 

Also  note  that  there  is  a  lot  of  flexibility  on  how  the  nested  loops  can  be  arranged  in  order 
to  compute  the  periodized  data  gi  from  the  original  data  set  /.  This  flexibility  can  also  be  used 
to  minimize  cache  misses.  For  a  large  prime  number  P,  the  ordering  of  the  nested  loops  that 
sequentially  addresses  the  elements  of  the  larger  data  array  /,  seems  more  advantageous.  This  is 
because  the  ratio  of  sizes  of  the  two  arrays  /  and  gi  increases  with  P,  and  the  whole  gi  matrix  can 
most  probably  fit  into  the  cache.  Therefore  accessing  the  data  array  /  sequentially  allows  to  reduce 
the  cache  misses  since  the  data  /  can  be  imported  into  the  cache  in  a  column  by  column  fashion 
and  then  transformed  to  the  periodized  data. 

The  execution  of  the  P  multi-dimensional  FFTs  could  become  faster  by  either  grouping  to¬ 
gether  or  interleaving  the  tasks  involved.  Recall  that  each  one  of  the  2D  or  3D  parallel  FFTs  using 
clustering  consists  of  three  tasks:  (i)  2D  or  ID  FFTs,  (ii)  Communication  (global  transposition) 
and  (iii)  ID  FFTs.  Therefore  the  following  two  optimizations  are  possible:  (1)  Group  all  the  corre¬ 
sponding  tasks  of  the  P  multi-dimensional  FFTs  together  and  do  the  same  with  the  communication 
stages,  so  that  only  one  communication  startup  time  is  needed  instead  of  P .  (2)  Employ  a  vector- 
pipelined  parallel  3D  FFT:  Using  asynchronous  communication  calls,  computations  associated  with 
the  next  FFT  can  be  interleaved  with  communications  required  for  the  previous  FFT.  Since  the 
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communication  time  accounts  for  as  much  as  50  %  of  the  overall  time,  pipelining  strategies  are 
expected  to  greatly  improve  the  performance. 

When  each  one  the  nodes  has  completed  the  computation  of  a  set  of  multi-dimensional  FFTs, 
the  independent  P-point  DFTs  have  to  be  performed.  One  method  is  to  compute  the  P-point 
DFTs  after  all  the  previous  computations  associated  with  the  data  set  have  been  completed.  It 
has  the  advantage  of  allowing  the  use  of  highly  optimized  vectorized  assembly  routines.  Another 
method  is  to  use  the  partial  contribution  of  each  sub-block  of  data  after  the  partial  periodization 
and  2D  (3D)  FFT  is  being  computed  and  as  soon  as  these  data  become  available.  Note  that  as 
soon  as  a  2D  (3D)  FFT  for  a  part  of  the  periodized  data,  corresponding  to  a  point  on  the  line, 
has  been  computed  the  contribution  of  that  buffer  to  the  overall  P-point  DFT  can  be  computed. 
Although  this  implementation  has  the  drawback  of  requiring  a  larger  amount  of  computations 
relative  to  the  first  method,  it  has  the  advantage  of  more  efficient  balancing  between  computations 
and  communications  since  these  two  phases  can  be  interleaved. 

The  extension  of  the  HRTA  algorithm  to  include  periodizations  with  respect  to  sets  of  higher 
dimensionality  (planes  instead  of  lines)  is  worth  investigating.  It  is  expected  to  lead  to  completely 
asynchronous  implementations  that  take  full  advantage  of  the  large  number  of  processors  available 
and  increase  the  achievable  efficiency  for  problems  of  very  large  size. 
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D  Weyl-Heisenberg  Systems  and  the  Finite  Zak  Trans¬ 
form 


Abstract 

Previously,  a  theoretical  foundation  for  designing  algorithms  for  computing  Weyl-Heisenberg 
coefficients  at  critical  sampling  was  established  applying  the  finite  Zak  transform.  This  theory 
established  clear  and  easily  computable  conditions  for  existence  of  Weyl-Heisenberg  expansion 
and  for  stability  of  computations.  The  main  computational  task  in  the  resulting  algorithm  was 
a  2-dimensional  finite  Fourier  transform. 

In  this  work  we  extend  the  applicability  of  the  approach  to  rationally  oversampled  Weyl- 
Heisenberg  systems  by  developing  a  deeper  understanding  of  the  relationship  established  by  the 
finite  Zak  transform  between  linear  algebra  properties  of  Weyl-Heisenberg  systems  and  function 
theory  in  Zak  space.  This  relationship  will  impact  on  questions  of  existence,  parameterization 
and  computation  of  Weyl-Heisenberg  expansions. 

Implementation  results  on  single  RISC  processor  of  i860  and  the  PARAGON  parallel  mul¬ 
tiprocessor  system  are  given.  The  algorithms  described  in  this  paper  possess  highly  parallel 
structure  and  are  especially  suited  in  a  distributed  memory  parallel  processing  environment. 
Timing  results  show  that  real-time  computation  of  W-H  expansions  is  realizable. 


D.l  Introduction 

During  the  last  four  years  powerful  new  methods  have  been  introduced  for  analyzing  Wigner 
transforms  of  discrete  and  periodic  signals  [7,  8,  10]  based  on  finite  Weyl-Heisenberg  (W-H) 
expansions  [1,  4,  5,  9].  A  recent  work  [7]  adapted  these  methods  to  gain  control  over  the 
cross-term  interference  problem  [6]  by  constructing  signal  systems  in  time  frequency  space  for 
expanding  Wigner  transforms  from  W-H  systems  based  on  Gaussian-like  signals. 

The  computational  feasibility  of  the  method  in  [7]  depends  strongly  on  the  availability  of 
efficient  and  stable  algorithms  for  computing  W-H  expansion  coefficients.  Since  in  general, 
W-H  systems  are  not  orthogonal,  standard  Hilbert  space  inner  product  methods  do  not  apply. 
Moreover  since  critically  sampled  W-H  systems  may  not  form  a  basis,  oversampling  in  time- 
frequency  is  necessary  for  the  existence  of  arbitrary  signal  expansions.  In  fact  this  is  usually  the 
case  for  systems  based  on  the  Gaussian.  In  [7,  8,  9,  10,  11],  the  concept  of  biorthogonals  was 
applied  to  the  problem  of  W-H  coefficient  computation.  In  [11],  the  Zak  transform  provided 
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the  framework  for  computing  biorthogonals  for  rationally  oversampled  W-H  systems  forming 
frames.  A  similar  approach  for  critically  and  integer  oversampled  W-H  systems  can  be  found 
in  [2,  3].  The  goal  in  this  work  is  somewhat  different  in  that  major  emphasis  is  placed  on 
describing  linear  spans  of  W-H  systems  which  are  not  necessarily  complete  and  on  establishing 
in  a  form  suitable  for  RISC  and  parallel  processing,  algorithms  for  computing  W-H  coefficients 
of  signals  in  such  linear  spans.  For  the  most  part  our  approach  extends  on  that  developed  in 
[2]  and  frame  theory,  an  important  part  in  [11]  plays  no  role  in  this  work.  However  as  in  these 
previous  works,  the  finite  Zak  transform  will  be  established  as  a  fundamental  and  powerful 
tool  for  studying  critically  sampled  and  rationally  overs ampled  W-H  systems  and  for  designing 
algorithms  for  computing  W-H  coefficients  for  discrete  and  periodic  signals.  The  role  of  the 
finite  Zak  transform  is  analogous  to  that  played  by  the  Fourier  transform  in  replacing  complex 
convolution  computations  by  simple  pointwise  multiplication.  In  this  new  setting  properties 
of  W-H  systems  such  as  their  spanning  space  and  dimension  can  be  determined  by  simple 
operations  on  functions  in  Zak  space.  This  relationship  will  impact  on  questions  of  existence, 
parameterization  and  computation  of  W-H  expansions. 

In  the  oversampled  case  both  integer  and  rational  oversampling  are  investigated.  Imple¬ 
mentation  results  on  single  RISC  processor  of  i860  and  the  PARAGON  parallel  multiprocessor 
system  are  given  for  sample  sizes  both  of  powers  of  2  and  mixed  sizes  with  factors  2,  3,  4,  5,  6,  7, 
8,  9.  The  algorithms  described  in  this  paper  possess  highly  parallel  structure  and  are  especially 
suited  in  a  distributed  memory  parallel  processing  environment.  Timing  results  on  single  i860 
processor  and  on  4-  and  8-node  computing  systems  show  that  real-time  computation  of  W-H 
expansions  is  realizable. 

In  section  2,  the  basic  preliminaries  will  be  established.  Algorithms  will  be  described  in 
section  3  for  critically  sampled  W-H  systems,  in  section  4  for  integer  oversampled  systems  and 
in  section  5  for  rationally  oversampled  systems.  Implementation  results  will  be  given  in  sections 

6,  7  and  8. 

D .  2  P  r  eliminar  ies 

D.2.1  Weyl-Heisenberg  systems 

Choose  an  integer  A"  >  0.  A  discrete  function  /(a),  a  e  Z  is  called  A-periodic  if 
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Denote  by  L{N)  the  Hilbert  space  of  all  A^-periodic  functions  with  inner  product 

< />5  >=  S /(«)5'*(«)>  f^g^L{N) 

a=0 

For  Q  <m,n  <  N  and  g  G  L{N)  define  gm.,n  €  L{N)  by 

=  g(a  +  «  €  Z.  (1) 

Suppose  N  =  KM  =  K'M'.  The  Wey I- Heisenberg  (W-H)  System  {g,M',K)  is  the  set  of 
functions 

{9m'M',n'K  ■  0  <  m'  <  K' ,  0  <  n'  <  M}.  (2) 

We  distinguish  three  cases 

critically  sampled  K  =  K' ^  M  =  M', 
oversampled  K'  >  K,  M'  <  M, 


We  further  distinguish  two  classes  of  oversampled  W-H  systems. 

Integer  oversampled  M  =  RM',  i?  €  Z 

Rational  oversampled  M  —  RM',  i?  G  Q,  R  ^  Z. 

undersampled  K'  <  K,  M'  >  M. 

An  expansion  of  /  G  L{N)  over  a  W-H  system  is  called  a  W-H  expansion. 

D.2.2  Finite  Zak  transform  (FZT) 

Suppose  N  =  KM.  For  /  G  L{N)  define  the  finite  Zak  Transform  (FZT),  Z{K)f{a,  h),  a,b  e 
Z  by 

Z{K)f{a,  b)  =  f{a  +  a,  6  G  Z.  (3) 

r=0 

Elementary  properties  of  FZT  including  FZT  based  algorithms  for  computing  W-H  expansions 
over  complete  critically  sampled  W-H  systems  can  be  found  in  [2].  We  will  briefly  discuss  these 
results  without  proof  and  extend  the  role  of  the  FZT  to  general  W-H  systems. 
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Theorem  !//’/€  L{N)  then 

ZiK)fia  +  M,  6)  =  6),  a,  6  6  Z.  (4) 

Z{K)f(a,  b+K)  =  Z{K)f{a,  6),  a,  6  €  Z.  (5) 

Theorem  1  implies  Z{K)f  is  A^-periodic  in  each  variable  and  is  completely  determined  by  its 
values 

Z{K)f{a,  b),  0  <a<  M,  0<b<  K.  (6) 

Denote  by  L{M,  K)  the  Hilbert  space  of  all  functions  F(a,  6),  0  <  a  <  M,  0  <  b  <  K,  with 
inner  product 

M-lK-l 

<  F,G  ^  F(a,b)Cr(a,b),  F,a€L{M,K).  (7) 

a=0  6=0 

Define  Zo{K)f  6  L{M,K)  by 

Zo{K)f{a,b)  =  Z{K)f{a,b),  0<a<M,  0<b<K.  (8) 

In  [2]  we  find  the  following  theorem. 

Theorem  2  The  mapping  K~^/'^Zq{K)  is  an  isometry  from  L{N)  onto  L{M,K).  If  F  E 
L{M,K)  and  f  €  L{N)  is  defined  by 

K-\ 

f{a  +  Mr)  =  K-^  F{a,  0<a<M,  0<b<K,  (9) 

6=0 

Then  F  =  ZoiK)f. 

For  /  =  L{N)  and  F  =  Zo{K)f,  we  can  summarize  the  preceding  discussion  by  the  matrix 
formula 

F(0,0)  ^(1,0)  •  •  F(M-1,0) 

mi) 


# 


# 


m 

F{M) 


F{M-1,K-1)  \ 
/(I)  •  •  /(M-1) 


mK-l)M)  ■  ■  ■  f{N-l) 


=  F{K) 


7 
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where  F{K)  is  the  if-point  Fourier  transform  matrix 

F{K)  = 

Throughout  this  work  we  will  identify  L{N)  with  L{M,  K)  by  theorem  2  and  the  matrix 
formula.  For  the  most  part,  including  the  computation  of  W-H  expansions,  once  we  are  in 
L{M,K)  we  never  need  to  formally  return  to  L{N). 


w 


Jk 


w  =  e 


27riz 


D.2.3  Basic  formulas 

The  following  two  theorems  are  proved  in  [2]. 

Theorem  3  If  g  €  LiH))  N  =  KM,  and  0  <  m,n  <  N ,  then 

Z{K)gm,n{a,  b)  =  Z{K)g{a  +  m,  6  -  n),  a,  6  €  Z.  (10) 

In  particular,  if  0  <  <  K ,  0  <  n'  <  M ,  then 

=  o.iSZ.  (11) 

By  theorem  1,  the  product  function 

Z{K)f(a,b)Z-(K)g(a,b),  a,b  €  Z  f,g  €  L{N),  (12) 

is  M-periodic  in  the  variable  a  and  if-periodic  in  the  variable  b  and  can  be  viewed  as  a  function 
in  L{M,K).  The  Fourier  expansion  of  the  product  function  is  given  in  the  following  theorem. 

Theorem  4  For  f,g£  L{K),  N  =  KM, 

1  K-\  M-\ 

Z(K)f{a,  b)Z'{K)g{a,  6)  =  ^  E  E  <  /■  (13) 

D.3  Critically  Sampled  W-H  Systems. 

Theorem  4  is  a  powerful  tool  for  analyzing  W-H  systems.  We  first  consider  critically  sampled 
W-H  systems  by  extending  the  following  result  [2]. 

Theorem  5  The  critically  sampled  W-H  system 

{g,  M,  K)  =  {g^'M,n'K  0  <  m'  <  if,  0  <  n'  <  M}  (14) 


is  a  basis  of  L{N)  if  and  only  if  G  =  Zo{K)g  never  vanishes. 
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By  theorem  4  and  the  linear  isomorphism  established  in  theorem  2,  we  can  identify  the 
space  of  all  f  e  L{N)  satisfying 

<  f,9m'M,n'K  >=  0,  0  <  TTl'  <  K,  0  <  n'  <  M,  (15) 

with  the  space  of  all  F  €  L{M,K)  satisfying 

FG^O,  G  =  Zo{K)g.  (16) 

The  space  of  such  F  6  L{M,  K)  can  be  identified  with  the  orthogonal  complement  of  the  linear 
span  of  {g,  M,  K).  If  G  never  vanishes  this  complement  is  {0}  and  {g,  M,  K)  is  a  basis  of  L{N) 
which  is  the  content  of  theorem  5.  More  generally,  we  have  the  following  result. 

Theorem  6  If  the  zero  set  (  of  G  =  Zo{K)g  has  exactly  J  points  then  the  dimension  of  the 
linear  span  of{g,M,K)  is  N  -  J.  A  function  f  6  L{N)  is  in  the  linear  span  of{g,M,K)  if 
and  only  if  F  =  Zo{K)f  vanishes  on  C- 

If  F  vanishes  on  (,  then  we  can  write 

F  =  GP,  PeL{M,K). 


In  this  case 

K--1  M-1 

/  =  ^  c{m'M,n'K)gm'M,n>K 

(17) 

if  and  only  if 

m'=0  n'=0 

K-l  M-1 

P{a,b)=Yl  Ec(m'M,n'ii:)e-2-(-'/M+6m7if)^ 

m'=:0  n'=:0 

(18) 

The  W-H  expansion  coefficients  of  /  over  {g,  M,  K)  are  given  by  the  2D  M  x  K  FT  of  P. 
If  G  never  vanishes  then  P  is  uniquely  determined  and  the  mapping 


P^F  =  GP^f,  F  =  Zo{K)f 

defines  a  linear  isomorphism  from  L{M,K)  onto  L{N). 

Suppose  that  the  zero  set  of  G  has  exactly  J  points  with  J  >  0.  Then  [g,  M,  K)  is  linearly 
dependent  and  does  not  span  L{N).  Choose  /  G  L{N)  in  the  linear  span  of  {g,M,K).  For 
each  function 

a-.C-^C 


m 


define  P  =  P°‘  €.  L{M^  K)  by 


P(a,  h)  = 


a{a,h)  {a,b)€C- 


(19) 
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The  space  of  such  P  is  a  J-dimensional  subspace  of  L(M,K).  Since  F  vanishes  on  (,  P  =  GP 
leading  to  the  next  result. 

Theorem  7  If  the  zero  set  (  ofG  =  Zo{K)g  has  exactly  J  points  then  every  f  in  the  linear 
span  of{g,  M,  K)  has  a  J-dimensional  space  of  W-H  expansions  over  {g,  M,  K).  The  coefficient 
space  of  W-H  expansions  of  f  over  {g,M,K)  is  given  by  the  set  of  all  2D  M  x  K  FT  of  the 
#  J-dimensional  space  of  functions  P°‘  6  L{M,K). 


D.4  Integer  Oversampled  W-H  Systems 

Suppose  N  =  MK  =  M'K'  with  M  =  RM' ,  R  £  Z.  The  integer  oversampled  W-H  system 
g  =  {g,  M',  K)  is  the  disjoint  union  of  critically  sampled  W-H  systems. 


R-\ 

g  =  U  Sn  gr  =  {gr,M,K),  gr  =  grM'fi,  0  <  r  <  R.  (20) 

r=0 

It  is  just  as  simple  to  consider  the  more  general  case  where  g  is  the  disjoint  union  of 
critically  sampled  W-H  systems  g^  =  {gr,M,K),  gr  G  L{N),  0  <  r  <  R.  Denote  the  zero  set 
of  Gr  =  Zo{K)gr  by  Cr  and  set  (  =  flfTo^Cr-  Arguing  as  in  the  preceding  section  /  G  L{N)  is  in 
the  linear  span  of  g  if  and  only  if  P  =  Zo{K)f  can  be  written  the  form 

P  =  ^'p,,  Fr  =  GrPr,  PreL{M,K).  (21) 

r=0 


In  fact,  if 


R-l  K-l  M-l 

/  =  EE  E  P{m'M,  n'K){gr)m'M,n>K, 

r=0  m'=0  n'=0 


then  we  can  take 

K—l  A/— 1  /  f 

Pr=t. 

m'=0  n'=0 

As  a  consequence,  if  /  is  in  the  linear  span  of  g  then  P  vanishes  on  (. 

Conversely,  suppose  P  vanishes  on  (.  The  following  construction  defines  the  simplest  de¬ 
composition  of  P  of  the  form  (  21).  Define  ifr  G  L{M,  K),  0  <  r  <  P  by 


ifr{a,b) 


1,  (a,  b)  G  Cr 
0,  (a,  b)  ^  Cr- 
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Setting 

Fo  =  (1  -  ^o)F 

F,  = 


Fr-1  =  •  •  •  ^R-2F 

we  have 

F  =  Fo  +  tj^oF  =  Fo  +  -Fi  +  TpQtpiF  =  Fo  +  Fi  +  •  •  •  +  Fr_i, 

where  F^  vanishes  on  (r-  Since  (r  is  in  the  zero  set  of  Gr,  we  can  write  Fr  =  GrPri  Ft  €  K) 
and  /  is  in  the  linear  span  of  g,  proving  the  next  result. 

Theorem  8  If  g  is  the  disjoint  union  of  critically  sampled  W-H  systems  gr  =  {gr,M,K), 
0  <r  <  R  and  Cr  is  the  zero  set  of  Gr  =  Zo{K)gr,  0  <  r  <  R,  then  the  dimension  of  the  linear 
span  of  g  is  N  -  J  where  J  is  the  order  of  (  =  n^JoVr-  A  function  f  6  L{N)  is  in  the  linear 
span  of  g  if  and  only  if  F  =  Zo{K)f  vanishes  on  (. 


If  we  set 


then  we  can  write 


F(a,  b)  =  G{a,  b) 


Po{a,b) 


[  FR_i(a,6) 

Choose  /  G  L{N)  in  the  linear  span  of  g.  An  algorithm  for  computing  a  W-H  expansion  of 
/  over  g  is  given  a.s  follows. 


Decompose  F  =  Zo{K)f 


R-l 


F=Y^Fr,  FrGL{M,K) 


r=0 


where  Fr  vanishes  on  the  zero  set  of  0  ^  <  R. 

Compute  the  collection  of  2D  M  x  F  FT  of 

a(a,6)’ 


0  <  r  <  F. 
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This  stage  is  understood  to  be  taken  as  in  the  critically  sampled  case  with  arbitrary  values 
assigned  to  the  quotient  at  points  where  the  functions  Gr,  0  <  r  <  R  vanish. 

If  we  assume  that  TlogT  computations  are  needed  for  the  T-point  FT  then  the  complexity 
of  one  W-H  expansion  computation  is 

NlogK  +  R{NlogK  +  NlogM)  +  RN  (22) 

but  advantage  can  be  taken  of  the  large  number  of  zero  data  values. 

The  coefficient  set  of  W-H  expansions  of  /  €  L{N)  over  g  is  parameterized  by  the  collection 
of  decompositions  of  F  and  by  the  arbitrarily  assigned  values  to  the  quotients  at  the  points  Gi 
0  <  r  <  R. 

D.5  Rationally  Oversampled 

Denote  the  least  common  multiple  of  M  and  M'  by  M  and  set  M  =  MS  =  M'S'.  Then  S 
divides  K  and  N  =  M  ^ 

Theorem  9  The  rationally  oversampled  W-H  system  g  =  {g,  M'K)  is  the  disjoint  union  of  the 
undersampled  W-H  systems 

gs'  =  {gs',M,K),  gs<  =  gs'M',0,  0  <  s'  <  S'. 

Proof  We  can  write  0  <  m'  <  K'  uniquely  in  the  form 

K' 

m'  =  s'  -b  mS' ,  0  <  s'  <  5",  0  <  m  < 

The  theorem  follows  from 

gm'M',n'K  —  {g  a' M '  fl)mM  ,n' K  ^  0  <  n  <  M. 

Consider  the  undersampled  W-H  system  {g,M,K)  and  set  G  =  Zo{K)g.  Since 

=  0<a<M,  0<KK,  (23) 

/  e  L{N)  has  a  W-H  expansion  over  {g,JT,K)  if  and  only  if  F’  =  Zo{K)f  can  be  written  as 
F  =  GP  where  P  G  L{M,  K)  satisfies 

p(a,b+j)  =  P{a,b),  0<a<M,  0<b<K-j. 

For  the  rationally  oversampled  W-H  system,  g  =  {g,M',K),  set  Gs'  =  ZQ{K)gs',  where 
5s'  =  gs'M',0-  By  theorem  9,  and  the  preceding  discussion  we  have  the  following  result. 
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Theorem  10  A  function  f  £  L{N)  is  in  the  linear  span  of  g  if  and  only  if  F  =  Zo{K)f  has 
the  form 

F=^G,>Ps>,  (24) 

5^=0 

where  Ps<  £  L{M,  K)  satisfies 


Ps>{a,b+j)==  Ps'{a,b),  0<s'<S',  0<a<M,  0<b<K-j.  (25) 

A  collection  of  W-H  expansion  coefficients  of  f  over  g  is  given  by  the  collection  of  2D  M  x  j 
FT  of 

Ps',  0<s'  <S'. 


F{a,  b  +  5—) 


For  each  0<a<M,  0<6<f-,  define  F(a,  b)  €  C'^  by 

F(a,6)  = 

and  the  S  x  S'  matrix  G(o,  b)  by 

G(o,  6)  = 


0<s<S 


Gs'ia,  b  +  -S"^) 

o  Jo<s<5,0<s'<5' 


By  theorem  10,  /  is  in  the  linear  span  of  g  if  and  only  if  for  each  0<a<M,0<6<y,  there 
exists  P(a,  6)  G  C^'  such  that 


K 


F{a,b)  =  G{a,b)P{a,b),  0  <  a  <  M,  0  <  6  <  — . 

Denote  by  r(a,  b)  the  rank  of  G(a,  b).  The  dimension  of  the  linear  span  of  g  is 


Yf  r(a,b).  (26) 

0<a<M,0<{'<§ 

In  particular,  if 

r{a,b)  =  S,  0<a<M,0<b<j  (27) 

then  g  is  complete  and  every  /  G  L{N)  has  a  W-H  expansion  over  g. 

There  are  several  linear  algebra  techniques  and  programming  packages  that  can  be  applied 
to  characterize  the  linear  span  of  g  and  to  compute  W-H  expansion  coefficients  for  /  G  L{N) 
in  this  linear  span.  Gauss  elimination  is  perhaps  the  most  well  known  technique  but  QR- 
decompositions  or  singular  value  decompositions  (SVD)  of  G(a,  b)  are  more  suited  to  appli¬ 
cations  which  subject  W-H  expansion  coefficients  to  least-square  constraints.  We  will  briefly 
review  and  introduce  notation  for  SVD  at  this  time. 
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For  each  (a,  6)  €  0  <  a  <  M,  0  <  6  <  -j  the  singular  value  decomposition  of  G(a,  6)  has  the 
form 

G(a,6)  =  U(a,6)S(a,6)V(a,6) 

where  U(a,b)  is  a  unitary  S  x  S  matrix,  V(a,6)  is  a  unitary  S'  x  S'  matrix  and  S(a,  6)  is  a 
‘diagonal’  S  x  S'  matrix 


(ro(a,b) 


cri(a,b)- 


E(a,b)  = 


as-i(a,b) 


Denote  the  s-column  of  U(a,  6)  by  Uala,  b). 


Theorem  11  A  function  f  €  L{N)  is  in  the  linear  span  of  g  if  and  only  if  for  every  {a,b), 
0  <  a  <  M,0  <  b  <  F(a,  6)  is  in  the  linear  span  of 

{crs(o,  b)Us{a,  6)  :  0  <  s  <  5'}. 

For  /  in  the  linear  span  of  g  we  can  solve  for  P(a,5)  by  introducing  the  pseudo-inverse  of 
G(a,  b) 

G+(a,  6)  =  V-'(a,  6)S+(a,  6)U-'(a,  6), 
where  S+(a,  6)  is  the  S'  x  S  diagonal  matrix 

r  at{a,b)  1 


S+(a,6)  = 


Then 


■(a,  6) 


— cr^fa,  6)^0  ^  ^  7  ^ 

<r,(a,b)^  0<a<M,0<6<— . 

0,  crs{a,b)  =  0,  ^ 


P(a,  b)  =  G+(a,  b)F{a,  b),  0  <  a  <  M,  0  <  6  <  — . 
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The  multiplicative  complexity  of  the  computation  is 

NS'  K 

N{logK  +  5  +  1)  +  +  logM  +  S') 

where  NlogN  is  the  complexity  of  the  fV-point  FT  and  S^  is  the  complexity  of  the  action  of 
an  5  X  5  matrix  on  a  vector. 
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Integer  Oversampling:  g  =  {g,  M',  K),  M  =  RM',  i?  €  Z. 


/ 


Pr{a,b) 

•  |2D  M  xKFT 

n'K),  0<m'  <K,  0  <  n'  <  M 

•  c{rM'  +  m'M,  n'K)  =  n'K) 


K'-l  M-l 

f=EE  c(^77l  M  ,  Tl  K^ffrn' M* * 

m'=0  n'=0 


Finite  Zak  Transform 


Rational  Oversampling;  g  =  (^r,  M\  K),  M  =  RM',  R  e  Q  R  ^  Z. 


/ 

FZT 

F 


F(a 

.6)ec^ 

G+(a,6)  . 

P(a,  6)  G  C 


S' 


2D  M  X  f  FT 


-1  M-l 


^.'(0,1')  =  E  E  c,.(m,n')e-^"(V+=7r-) 


771=0  7l'=0 


c{m.',  n')  —  Csi{m,  n'),  m'  =  s'  +  mS',  0  <m'  <  K',  0  <  m'  <  M. 


K'-lM-l 

/=E  E  c(?n  Tl  K 


Finite  Zak  Transform 


138 


D.6  Implementation  Results 

In  this  section  we  describe  implementation  issues  and  present  timing  results  for  the  implemen¬ 
tation  of  the  algorithms  presented  in  the  previous  sections.  Implementations  on  single  Intel  i860 
RISC  microprocessor  as  well  as  on  the  Paragon  multi-processor  parallel  platform  are  reported. 

D.6.1  Critical  sampling  (C.S.) 

We  have  tested  three  basic  analysis  functions: 

•  Gaussian  function 

When  K  and  M  are  both  even  integers,  the  FZT  of  Gaussian  window  function  has  a 
zero  at  (R'/2,  M/2).  Set  Q{Kl2,Mf2)  =  0.0.  The  total  energy  of  Gabor  coefficients  will 
be  minimum. 

When  either  K  or  M  is  an  odd  integer,  or  both  of  them  are  odd  integers,  the  FZT 
of  Gaussian  window  function  has  no  zeros. 

•  Rectangular  function 

A  small  size  rectangular  window  will  result  in  FZT  with  no  zeros.  For  example,  N  = 
K  X  M  =  1200,  a  window  of  width  90  centered  at  600,  has  no  zeros  in  Zak  space. 

A  rectangular  window  of  width  150  centered  at  600  has  zeros  in  Zak  space  located  at: 
(j,8),  (j,16),  (j,24),  (j,32),  where  j=0  to  39. 

•  Triangular  function 

When  either  AT  or  M  is  an  odd  integer,  or  both  of  them  are  odd  integers,  there  are 
no  zeros  in  Zak  space. 

A  relatively  small  triangular  window  will  result  in  a  single  zero  at  the  center  of  Zak 
space.  For  example,  AT  =  40  x  30  =  1200,  a  window  of  61  non-zero  values  centered  at 
600,  has  one  zero  in  Zak  space  at  (20,  15). 

We  have  implemented  the  computation  for  Critical  Sampling  case;  the  main  program  is 
in  FORTRAN  and  the  FFT  modules  are  fine-tuned  i860  assembly  with  mixed  sizes.  Timing 
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results  are  given  in  tables  1  and  2. 

Complexity 

For  a  real  input  signal  /,  the  FZT  of  /  is  Hermitian  symmetric  along  K  dimension.  If  the 
analysis  signal  is  also  real,  then  the  2-D  M  x  K  Q{a,  b)  has  the  same  symmetry.  The  inverses  of 
the  FZT  of  g{a,  b)  are  pre-computed  and  stored  in  memory.  The  complexity  of  the  computation 
is  {F{n)  denotes  the  complexity  of  n-point  FFT): 


Z{K)f  (FZT  of  /) 

M  X  real  F{K) 

Z{K)flZ{K)g 

Kl2  x  M  multiplications 

2-D  FT  of  Q 

Herm.  Symm.  along  K 

M  X  Herm.  F{K) 

K  X  real  F{M) 

Size  N 

2-D 

K 

X  M 

Time 

256 

16 

X 

16 

0.67 

512 

16 

X 

32 

1.20 

1024 

32 

X 

32 

2.02 

2048 

32 

X 

64 

3.98 

4096 

64 

X 

64 

7.41 

8192 

64 

X 

128 

14.96 

16384 

128 

X 

128 

29.82 

32768 

128 

X 

256 

60.89 

65536 

256 

X 

256 

125.55 

131072 

256 

X 

512 

264.60 

262144 

512 

X 

512 

566.99 

Table  1:  Timing  Results  (in  milliseconds)  on  the  Intel  i860  RISC  microprocessor  (Critical 
Sampling  -  2*) 

D.7  Integer  Oversampling 

We  choose  the  decomposition  F  =  Z{K)f  =  such  that  Fi,. . . , Fr-i  each  has  only  one 

non-zero  point,  so  that  the  computation  of  the  2D  FT  of  Qi{a,  b),. . .  QR-i{a,  b)  is  trivial.  The 
codes  are  similar  to  critically  sampled  case  with  data  rearrangement  at  the  end. 
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Size  N 

2-D  K  xM 

Time 

384 

8  X  48 

1.47 

768 

16  X  48 

1.99 

1536 

32  X  48 

3.12 

3072 

64  X  48 

5.91 

3072 

128  X  24 

6.15 

6144 

128  X  48 

12.07 

6144 

64  X  96 

12.48 

12288 

512  X  24 

26.07 

12288 

128  X  96 

24.05 

24576 

256  X  96 

48.70 

49152 

256  X  192 

98.71 

98304 

256  X  384 

203.52 

98304 

512  X  192 

209.12 

196608 

512  X  384 

433.41 

393216 

1024  X  384 

1011.61 

Table  2:  Timing  Results  (in  milliseconds)  on  the  Intel  i860  RISC  Microprocessor  (Critical 
Sampling  -  Mixed  sizes) 

D.7.1  Rational  oversampling 

In  [9],  the  authors  point  out  that  for  Gaussian  window  function,  over-sampled  more  than  20 
percent  (5/4),  does  not  have  significant  influence.  We  have  implemented  the  computation  for 
oversampling  rates  3/2  and  5/4.  Again,  the  main  routine  is  coded  in  FORTRAN,  and  the  DFT 
routines  are  fine-tuned  i860  assembly  codes  for  mixed  sizes.  For  the  complex  singular  value 
decomposition  (SVD)  we  used  the  UNPACK  routine.  We  have  tested  three  basis  functions: 

•  Gaussian  basis  function 

Rational  oversampling  of  3/2  and  5/4  were  tested.  If  the  rank{G{a,  b))  equals  to  2  or  4 
correspondingly,  then  g  is  complete  and  every  /  has  a  W-H  expansion  over  g. 


•  Rectangular  basis  function 

Rational  oversampling  by  3/2  and  5/4  are  tested.  Rectangular  window  sizes  have  to 
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be  chosen  such  that  it  is  not  a  factor  of  K  along  /^-dimension  to  have  every  /  expandable 
in  the  W-H  system. 

•  Triangular  basis  function 

An  example  of  size  A”  =  40  x  30  =  1200  has  been  tested  with  rational  oversampling 
by  3/2.  The  experimental  results  are; 

A  window  of  size  101  centered  at  600  results  in  an  expandable  VV-H  system. 

A  window  of  size  151  centered  at  600  results  in  an  expandable  W-H  system. 

A  window  of  size  201  results  in  point  (20,10)  being  a  zero  singular  value  in  Zak  transform 
space. 

Complexity 

In  the  case  of  real  input  and  real  analysis  signals  the  FZT  is  Hermitian  symmetric  along  K 
dimension.  We  can  show  that  the  S'  2-D  M  x  K/S  Ps{a,b)  has  Hermitian  symmetry  along 
K/S  dimension.  The  complexity  of  real-time  computation  is: 


FZT  of  / 

M  X  real  F(K) 

G+{a,b)F{a,b) 

M  X  K/S  matrix 

S'  X  S  multiply  a 

vector  S 

S'  2-D  FT  of  Ps  with 
Hermitian  Symmetry 
along  K/S 

S'  xM  X 

Hermitian  F{K/S), 

S'  X  K / S  X  real  F{M) 

Timing  results  of  various  sizes  are  given  in  the  following  tables. 


D.8  Parallel  Implementation 

Assume  that  a  distributed  memory  parallel  computer  has  p  (<  min(A’,  M))  processors.  Set 

P  =  KjK^  =  MIK2  (28) 
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Size  N 

2-D  K  X  M 

Time 

384 

16  X  24 

2.06 

768 

32  X  24 

2.97 

1536 

64  X  24 

5.31 

3072 

64  X  48 

10.79 

3072 

128  X  24 

10.05 

6144 

128  X  48 

20.85 

6144 

64  X  96 

22.86 

12288 

128  X  96 

43.15 

24576 

256  X  96 

84.71 

49152 

256  X  192 

171.39 

98304 

256  X  384 

412.12 

98304 

512  X  192 

413.50 

196608 

512  X  384 

840.02 

Table  3:  Timing  Results  (in  milliseconds)  In  the  Intel  i860  RISC  microprocessor  (Rational 
Oversampling  (3/2)) 

The  algorithms  described  in  sections  3,  4  and  5  possess  highly  parallel  structure.  They 
are  particularly  suitable  in  a  distributed  memory  multiprocessor  system.  For  example,  in  the 
critically  sampled  case,  the  algorithm  can  be  implemented  as  follows. 

•  Each  processor  receives  Ki  RT-point  input  data 

•  Compute  K\  K-point  real  FFT 

•  Point-wise  multiplication  of  the  pre-calculated  Zak  transform  of  the  basis  function  \IZ{K)g{a,  h) 

•  Compute  Ki  RT-point  Hermitian  FFT 

•  Data  permutation  between  processors  (matrix  transpose) 

•  Compute  K2  M-point  real  FFT 

Implementation  of  integer  over-sampled  case  has  similar  structure  as  the  critically  sampled 
case,  and  the  rationally  over-sampled  case  has  a  better  parallel  structure,  since  it  has  S  rel¬ 
atively  small  2-dimensional  K/S  x  M  FFT’s,  and  they  might  be  carried  out  locally  in  each 
processor  without  interprocessor  data  permutation.  Timing  results  of  critical  sampling  on  the 
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Size  N 

2-D  K  xM 

Time 

320 

8  X  40 

2.82 

640 

16  X  40 

3.85 

1280 

32  X  40 

5.66 

2560 

64  X  40 

9.65 

5120 

128  X  40 

16.42 

5120 

64  X  80 

18.32 

10240 

128  X  80 

32.09 

10240 

64  X  160 

37.99 

20480 

128  X  160 

67.65 

40960 

128  X  320 

134.08 

81920 

256  X  320 

258.40 

163840 

512  X  320 

522.19 

327680 

512  X  640 

1149.76 

Table  4:  Timing  Results  (in  milliseconds)  on  the  Intel  i860  microprocessor  (Rational  Oversam¬ 
pling  (5/4)) 


Intel  4-nodes  and  8-nodes  Paragon  are  given  in  tables  6  and  7.  The  parallel  flow  diagram  is 
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c(a,0)  c(a,  1)  •••  c(a,K  1) 


0<r  <K  -1,  Q<a<M  -1 

Fig.  3.  Parallel  implementation  flow  diagram 


D.9  Conclusions 

Algorithms  for  the  computation  of  Weyl-Heisenberg  (W-H)  coefficients  for  the  cases  of  critical 
sampling,  integer  oversampling  and  rational  oversampling  have  been  presented  and  easily  com¬ 
putable  conditions  for  the  existence  of  W-H  expansions  have  been  derived  in  terms  of  the  Zak 
transform  of  the  signal  and  the  analysis  function.  We  have  shown  that  the  algorithms  described 
lead  to  very  efficient  FFT  based  implementations  both  for  single  DSP  processor  systems  as  well 
as  for  parallel  multi-processor  configurations. 
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Size  N 

2-D  K  X  M 

Time 

16384 

128  X  128 

10.06 

32768 

128  X  256 

19.66 

65536 

256  X  256 

39.31 

131072 

256  X  512 

80.24 

262144 

512  X  512 

163.10 

524288 

512  X  1024 

368.99 

1048576 

1024  X  1024 

801.82 

2097152 

1024  X  2048 

1661.96 

Table  5:  Timing  Results  (in  milliseconds)  on  the  Intel  Paragon  (4-nodes) 


Size  N 

2-D  a:  X  M 

Time 

65536 

256  X  256 

22.18 

131072 

256  X  512 

42.45 

262144 

512  X  512 

86.32 

524288 

512  X  1024 

189.54 

1048576 

1024  X  1024 

404.32 

2097152 

1024  X  2048 

840.17 

8388608 

2048  X  2048 

1716.03 

Table  6:  Timing  Results  (in  milliseconds)  on  the  Intel  Paragon  (8-nodes) 
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E  Group  Invariant  Fourier  Transform  Algorithms 


E.l  Introduction 

The  design  of  algorithms  for  computing  the  crystallographic  Fourier  transform  is  a  subject  in 
applied  group  theory.  In  previous  works  [2,  19]  we  exploited  several  elementary  results  in 
finite  abelian  group  theory  and  developed  the  basic  abstract  constructs  underlying  the  class 
of  divide  and  conquer  algorithms  for  computing  the  multidimensional  (MD)  discrete  Fourier 
transform  (DFT).  This  setting  provides  a  convenient  landscape  for  introducing  a  class  of  divide 
and  conquer  crystallographic  algorithms.  In  [2]  we  outlined  a  systematic  approach  for  classi¬ 
fying  3-dimensional  (3D)  crystallographic  groups.  Applications  to  3D  crystallography  require 
a  detailed  understanding  of  this  classification.  Similar  classifications  exist  to  some  extent  in 
higher  dimensions  and  are  equally  important  for  applications  to  quasicrystallography. 

The  theory  developed  in  this  work  will  operate  within  the  abstract  formulation  presented  in 
[2,  19].  Finite  abelian  groups  will  serve  as  data  indexing  sets.  A  class  of  affine  group  fast  Fourier 
transform  (FFT)  algorithms  will  be  introduced  which  fully  utilize  data  invariance  with  respect 
to  subgroups  of  the  affine  group  of  data  indexing  sets.  The  affine  subgroup  need  not  come  from  a 
crystallographic  group.  This  approach  removes  dimension,  transform  size  and  crystallographic 
group  from  algorithm  design  and  serves  to  bring  out  fundamental  algorithmic  procedures  rather 
than  produce  an  explicit  algorithm.  These  procedures  provide  tools  for  writing  code  which 
scales  over  dimension,  transform  size  and  crystallographic  group  and  which  can  be  targeted  to 
various  architectures.  In  fact  these  methods  apply  to  all  230  3D  crystallographic  groups  and  to 
composite  transform  sizes.  We  will  show  the  power  of  these  tools  by  way  of  an  extensive  list  of 
implementation  examples. 

We  distinguish  three  algorithmic  strategies.  The  first  is  based  on  the  well-known  Good- 
Thomas  (GT)  or  prime  factor  algorithm  which  breaks  up  a  FT  computation  into  a  sequence 
of  smaller  size  DFT  computations  determined  by  the  relatively  prime  factors  of  the  initial 
transform  sizes.  In  [2]  we  developed  an  abstract  formulation  of  the  GT  and  applied  it  as  a 
tool  for  crystallographic  algorithms.  Our  treatment  here  will  be  brief  and  mostly  contained  in 

examples. 

Reduced  transform  (RT)  algorithms  were  considered  in  detail  m  [2,  19].  A  simple  general¬ 
ization  of  the  RT  approach  based  on  collections  of  subgroups  will  be  presented,  which  provides  a 
universal  framework  for  affine  group  Fourier  transform  (FT)  algorithms.  In  applications  to  3D 
crystallography  this  class  of  algorithms  replaces  the  problem  of  computing  the  FT  of  3D  group 
invariant  data  by  that  of  computing  in  parallel  the  FT  of  collection  of  ID  or  2D  group  invariant 
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data  sets.  The  latter  problem  is  substantially  simpler  and  several  efficient  implementations  are 
widely  practiced.  (See  appendix). 

A  third  approach,  based  on  a  generalization  of  Cooley-Tukey  fast  FT  (CT  FFT),  will  be 
discussed  which  performs  generalized  periodizations  [19]  with  respect  to  affine  subgroups.  This 
method  applies  to  abelian  affine  subgroup  invariant  data  and  hence  to  about  100  of  the  230 
3D  crystallographic  groups.  A  CT  FFT  algorithms  associated  to  an  abelian  subgroup  X  of 
the  affine  group  provides  code  for  Y  invariant  data  with  respect  to  every  subgroup  Y  oi  X. 
In  applications,  we  choose  X  such  that  the  associated  CT  FFT  is  easy  to  code  and  efficient 
and  such  that  X  contains  a  large  collection  of  subgroups  Y  of  interest.  X  itself  need  not  be  a 
crystallographic  group.  An  example  will  be  provided  which  shows  how  one  code  applies  to  71 
of  the  crystallographic  groups. 

This  work  is  organized  as  follows.  In  chapter  II,  we  will  review  all  the  necessary  group  theory. 
Finite  abelian  group  theory  will  be  briefly  considered  as  it  is  covered  in  many  elementary  texts. 
We  reference  [19]  as  it  contains  all  the  necessary  results.  The  affine  group  of  a  finite  abelian 
group  will  be  defined.  Constructs  related  to  the  action  of  affine  subgroups  on  data  indexing 
sets  will  be  introduced.  In  chapter  III  we  define  the  Fourier  transform  of  an  abelian  group  and 
study  its  fundamental  role  in  interchanging  periodization  and  decimation  operations  (duality). 
The  RT,  CT  FFT  and  GT  algorithms  are  presented  in  chapter  IV  as  applications  of  this  duality 
to  different  global  decomposition  strategies. 

Affine  group  FFT  algorithms  based  on  the  RT  algorithm  are  discussed  in  chapter  VI,  while 
those  coming  from  the  application  of  the  affine  group  CT  FFT  are  introduced  in  chapter  VIII.  In 
chapter  IX,  we  briefly  sketch  a  method  of  incorporating  ID  symmetry  into  FFT  computations, 
which  calls  on  lower  order  existing  FFT  routines  using  the  symmetry  condition. 

Throughout  this  work,  we  will  provide  many  examples.  These  examples  have  been  chosen 
to  reflect  both  the  theory  and  our  experience  and  others  over  several  years  in  writing  code  for 
the  3D  crystallographic  FT. 

E.2  Group  Theory 

E.2.1  Finite  abelian  group 

Denote  by  Z/N  the  group  of  integers  modulo  N  consisting  of  the  set 

with  addition  taken  modulo  N.  Z/iV  is  a  cyclic  group  of  order  N  and  every  cyclic  group  of 
order  N  is  isomorphic  to  Z/N.  For  example,  the  multiplicative  group  Un  of  complex  Nth  roots 
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of  unity 

w  =  e^, 

is  a  cyclic  group  of  order  N  and  the  mapping 

(jj  :  Ti/N  — >  Un 

defined  by  io{n)  =  w'^,0<n<  N,  is  a  group  isomorphism  from  Z/N  onto  Un^ 

The  direct  product  of  two  finite  abelian  groups 

Ai  X  A2 

is  the  set  of  all  pairs  (01,02),  oi  €  /li,  02  6  A2  with  componentwise  addition.  By  the  funda¬ 
mental  theorem  of  finite  abelian  groups,  every  finite  abelian  group  A  is  isomorphic  to  a  direct 
product  of  cyclic  groups, 

Ac^ZjNi  X  •  ■  ■  xZINr.  (1) 

We  call  Eq.  (1)  a  presentation  of  A.  A  finite  abelian  group  can  have  several  presentations 
which  vary  as  to  the  number  of  cyclic  group  factors  as  well  as  the  orders  of  the  cyclic  groups. 
For  example 

Z/30  ^  Z/2  X  Z/15  ~  Z/3  x  Z/10 

~  Z/5  X  Z/6  ~  Z/2  X  Z/3  x  Z/5 


In  general,  we  have 

Theorem  E.l  The  direct  product  of  cyclic  groups  having  relatively  prime  orders  is  a  cyclic 
group. 

Theorem  E.l  is  a  special  case  of  the  Chinese  remainder  theorem  (CRT). 

Theorem  E.2  Chinese  Remainder  Theorem 

Let  N  =  NxN2  ■■■Nr  be  a  factorization  of  N  into  pairwise  relatively  prime  integers.  Then 
there  exist  uniquely  determined  integers 

0  <  ei,  62,  •  •  ■ ,  ei?  <  A/” 


satisfying 


Cr  =  1  mod  Nr 


er=0modNs,  l<r,s<R,  r  ^  s. 
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The  set  {ei,  62,  •  ■  • ,  sr}  is  called  the  complete  system  of  idempotents  for  the  factorization  N  — 
N^N^-'-Nr. 

Let  {ei,  62,  •  •  • ,  cr}  be  the  complete  system  of  idempotents  for  the  factorization  N  = 
NiN2---Nr.  By  CRT 


=  6r  modN,  (2) 

eres  =  0modN,  l<r^s<R,  r^s  (3) 

R 

^  =  1  mod  N.  (4) 

r=l 

It  follows  that  every  n  e  ZjN  has  a  unique  expansion  of  the  form 

7i  =  Ri6i  +  ^2^2  T  ■  ■  ■  T  riReR  mod  N,  rir  ^  Z/iV^. 


In  fact 


Ur  =  n  mod  Nr,  1  ^  ^  -R- 

CRT  shows  that  the  mapping 

X'.ZIN  Z/Ni  X  Z/N2  X  •  •  •  X  Z/Nr. 


defined  by 

X{n)  =  {ni,n2,---  ,nr),  nr  =  nmodNr,  l<r<R  (5) 

is  an  isomorphism  having  inverse 

72.2,  •  •  •  ,  72,.)  =  Til 61  +  72262  +  ••  •  mod  N.  (6) 

CRT  is  the  basis  for  many  theoretic  and  applied  results  in  algorithm  design.  It  is  a  major 
tool  for  interchanging  between  ID  and  MD  arrays  which  is  the  core  of  the  GT  algorithm.  The 
use  of  idempotents  in  describing  this  interchange  is  most  important  in  implementation  [19]. 

CRT  can  be  used  to  derive  the  primary  factorization  of  a  finite  abelian  group.  Suppose  A 
is  a  finite  abelian  group  of  order  N,  and  we  write 


II 

to 

'  Q 

IV 

(7) 

where  Pi,  P2,  • 

■  •,  Pm  distinct  primes.  Choose  any  presentation  of  A 

A  ci  Z/Ni  X  •  •  •  X  ZINr,  N  =  Ni  -  ■  ■  Nr 

(8) 

and  write 

Nr  =  «m(r)>0,  l<77i<M. 
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Then 


Z//V,  =  z/pr*'’  y 
and  we  have,  by  rearranging  factors,  the  primary  factorization  of  A 

A  ~  Ai  X  •  •  •  X  Am, 

where 


(9) 


The  primary  factorization  of  A  is  unique  as  the  factors  Am  can  be  described  as  the  set  of 
all  elements  in  A  having  order  a  power  of  the  prime  Pm- 

E.2.2  Character  group 

Consider  a  finite  abelian  group  A  of  order  N.  The  character  group  A*  of  A  is  the  set  of  all 
group  homomorphisms 

a*  :  A  — >  Uf\j 

with  group  addition  defined  by 

(a*  +  b*){a)  =  a*{a)b*{a),  a*,  b*  eA*,ae  A.  (10) 

The  character  group  A*  is  the  natural  indexing  set  for  FT  as  we  can  view  A  as  the  time 
parameter  space  and  A*  as  the  frequency  parameter  space. 

We  will  usually  write  a*{a)  as  <  a,  a*  >. 

The  mapping 

:  Z/AT  ->  (Z/N)* 


defined  by 


<  m,  f){n)  >=  ,  0  <  n,m  <  N 


establishes  an  isomorphism 

Z/N  ~  (Z/iV)*. 

More  generally,  the  mapping 

^  ;  Z/A^I  X  •  •  •  X  Z/Nr  (Z/iVi  X  •  •  •  X  Z/NrY 


<  (n^l,  •  •  • ,  rriR),  ^(ni,  ■  ■  •  ,nR)  >-  e 


27rt“ 


2'7rz  “"77 
Ni  .  .  .  e  Nr 


defined  by 


(11) 
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establishes  an  isomorphism 

Z/Ni  X  •  •  ■  X  Z/Nr  ~  (Z/iVi  X  •  •  •  X  Z/Nr)*. 

By  the  fundamental  theorem,  every  finite  abelian  group  A  is  isomorphic  to  its  character  group 
A*. 


Duality 

Fix  an  isomorphism  (f)  from  A  onto  A* .  The  dual  of  a  subgroup  B  of  A  is  defined  by 

B-^  =  {a  e  A  :  <  b,<f>{a)  >  =  1,  for  all  6  G  B}.  (12) 

Since  <f>  is  an  isomorphism, 

cl>{B^)  =  €  B^] 

is  the  subgroup  of  all  characters  of  A  that  act  trivially  on  B. 

Consider  the  quotient  group  A/ B  of  5-cosets 

a  +  5  =  {a  +  t:6G  B} 


with  abelian  group  addition 

(a  +  5)  +  {a  +B)  =  {a  +  a')  +  B, 
The  isomorphism  (f)  induces  isomorphisms 

<t>,:B^^  {A/BY,  h  :  Aj B^  B\ 


by  the  formulas 

<  a  +  B ,  >  =  <  a,  > ,  a  €  A,  b'^^.B'^,  (13) 

<  b,  <j)2{a  +  B^)  >  =  <  6,  (^(a)  >,  a  £  A,  b  £  B.  (14) 

The  characterization  of  <i>{B^)  given  above  implies  both  induced  isomorphisms  are  well  defined, 
i.e.,  independent  of  coset  representation. 

The  induced  isomorphisms  (j)\  and  (j>2  p^^y  fundamental  roles  in  the  description  of  divide 
and  conquer  FT  algorithms. 
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The  vector  space  L(X). 

Denote  the  space  of  all  complex  valued  functions  on  a  finite  set  X  by  L(X}.  L(X}  is  a  vector 
space  over  C  with  addition  and  scalar  multiplication  defined  by 

{af){x)  =  a{f{x)),  aeC,  f  G  L{X),  x  ^  X. 

Consider  a  finite  abelian  group  A  and  a  subgroup  B  of  A.  For  /  G  LiA)  define 


Persfia)  =  ^  /(a  +  b) 

beB 


(15) 


and 


Decs  f  (a)  = 


(16) 


/(a),  a  E  B, 

0,  otherwise. 

The  periodization  operator  Berg  and  the  decimation  operator  Decg  are  fundamental  operators 
on  L{A). 

Suppose  A  has  order  N.  L{A)  has  dimension  N.  The  evaluation  basis  of  L{A)  is  the 
collection  of  functions 

{ca  ■  a,  €  X] 


defined  by 

&a{b)  = 

We  will  denote  the  evaluation  ba.sis  by  A. 

The  character  basis  of  Ll^A')  is  the  collection  A*  of  characters  of  A.  Relative  to  the  inner 
product  on  L{A)  defined  by 


1,  b  =  a, 
0,  b  a, 


be  A. 


(17) 


(/.«)  =  E /(%(“).  f.s^riA),  (18) 

aeA 


where  g{a^  denotes  the  complex  conjugate  of  5^(0);  the  evaluation  basis  is  an  orthonormal  basis 
of  L{A).  Since  for  a*,  b*  €  A*, 


ia\b*) 


N,  a*  =  b* 

O,  a*  ^  6*, 


the  set 


1 

7^ 


A* 


is  an  orthonormal  basis  of  L{A). 
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Canonical  isomorphism 

The  evaluation  basis  A  and  the  character  basis  A*  are  canonical  in  the  sense  that  they  depend 
solely  on  group  structures  and  not  on  presentation.  Although  the  groups  A  and  A*  are  iso¬ 
morphic,  there  is  no  canonical  isomorphism.  Duality  is  defined  relative  to  a  particular  choice 
of  isomorphism  from  A  onto  A*.  By  extension,  the  groups  A  and  A*  ,  the  dual  of  A*,  are  also 
isomorphic,  and  in  fact  a  canonical  isomorphism  can  be  defined.  The  canonical  isomorphism, 
as  we  will  see  in  chapter  III,  defines  the  FT  of  A. 

For  a  E  A,  the  mapping  0(a)  of  A* 

d(a)(a*)  =<  a,a*  >,  a*  €  A*,  (19) 

is  a  character  of  A*.  The  mapping 

0  :  A  A**  (20) 

is  a  canonical  isomorphism,  since  it  is  defined  without  reference  to  presentation. 

Consider  the  evaluation  basis  A  of  L[A)  and  the  character  basis  A**  of  L{A  ).  The  canonical 
isomorphism  0  of  A  onto  A**  defines  a  linear  isomorphism  T(0)  from  T(A)  onto  T(A  ). 

E.2.3  Point  group 

Denote  the  automorphism  group  of  a  finite  abelian  group  A  by  Aut{A).  Subgroups  of  Aut(A) 
are  called  point  groups. 

For  a  point  group  H  and  a  point  a  ^  A,  the  isotropy  subgroup  Ha  of  a  in  Ff  is  defined  by 

Ha  =  {aeH:  a{a)  =  a}.  (21) 

Ha  is  a  subgroup  of  H.  A  point  a  €  A  is  called  a  fixed  point  of  H  ii  H  =  Ha-  The  H-orbit  of 
a,  denoted  by  H{a),  is  defined  by 

H{a)  =  {a(a)  :aeH}.  (22) 

The  mapping 

a  ^  a(a)  -.H-^A  (23) 

induces  a  bijection  from  the  space  of  right  cosets  o^Ha,  oc  E  H,  onto  H{a). 

Fix  a  group  isomorphism  <j) A  — >  A*.  For  o  €  A'at(A),  define  the  adjoint  oc  G  Auti^AI)  by 

<a,<i>{oA{c))>  -  <a{a),(l){c)>,  a,  c  e  A.  (24) 
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Set  a*  =  and  observe  that 

For  a  point  group  H,  define 

H*  =  {a*  :ae  H). 

The  TZ-orbit  H{ B)  of  a  subgroup  B  of  A  is  the  collection  of  subgroups 

H[B)  =  {a{B)  :a€H}.  (25) 

Under  duality 

H*iB^)  =  {H{B))^.  (26) 

A  collection  B  of  subgroups  of  A  is  called  H -invariant  if 

h{B)  eB,  heH,  B  eB. 

If  B  is  ff-invariant,  the  action  of  H  partitions  B  into  disjoint  if-orbits.  Define  a  complete 
system  of  if-orbit  representatives  in  B  as  any  collection  of  subgroups  in  B 

By,---,BR 

such  that  B  is  the  disjoint  union  of  the  collection  of  if-orbits 

h{b,i---,h{Br). 

A  covering  of  A  is  a  collection  of  subgroups  H  of  A  such  that 

A  =  Gb^bB. 

Set 

B^  =  {B^:B€B]. 

We  say  that  0  is  a  dual  covering  of  A  if  B^  is  a  covering  of  A.  We  can  always  construct  an 
iJ-invariant  covering  B  oi  A. 


E.2.4  AfBne  group 

The  affine  group  of  A, 


A//(A)  =  A^  Aut(A), 


(27) 
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is  the  set  of  all  (a,  a),  a  e  A,  a  e  Aut{A),  with  group  composition 

(a,  Q;)(a',  a')  =  (a  +  a(a'),  aa').  (28) 

Aff{A)  acts  on  A  by 

(a,  q;)(c)  =  a  +  Q!(c),  a,  c  E  A,  a  E  Aut{A).  (29) 

For  X  E  Aff{A),  we  write  x  =  {ax,ax),  Oj,  E  A,  ax  G  Aut{A). 

We  define  two  actions  of  Aff{A)  on  L{A).  For  /  €  L{A)  and  x  E  Aff{A),  define 

xf{a)  =  /(x(a)),  aE  A.  (30) 

x*f{a)=<axy(t>{afa)>f{afa),  a  E  A.  (31) 

We  say  that  /  is  x-invariant  if  xf  =  f  and  x^-invariant  if  x^/  =  /. 

Choose  a  subgroup  X  of  Aff{A).  An  f  E  L{A)  is  X-invariant  if  /  is  x-invariant  for  all 
X  E  X,  and  X* -invariant  if  /  is  x’^-invariant  for  all  x  E  X. 

The  point  group  X  of  is  defined  by 

X=  {ax-.xE  X}.  (32) 

X  is  a  subgroup  of  Aut[A),  but  in  general  is  not  contained  in  X. 


E.2.5  Examples 
Example  E.l  P6i 

Crystallographic  group  P6i  [13]  is  generated  by 
•  x  =  (0,0,  M2,  a)) 

acting  on  Z/3iV  x  Z/3iV  x  Z/6M  for  natural  numbers  N  and  M, 
^  x(ai,  02, 03)  =  (ai  —  02) «!,  03  +  M). 

Throughout  the  rest  of  this  example,  we  will  set 

A  =  Z/12  X  Z/12  X  Z/12. 


For  (01,02,03)  E  A, 


x(oi,  02,  03)  =  (oi  —  O2,  Oi,  03  -f  2), 
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x^(ai,  02, 03)  =  (—02)  Oi  —  02, 03  +  4), 

x^(ai,  02, 03)  =  (— oi,  —02, 03  +  6), 
a;‘*(oi,  02, 03)  =  (02  —  Oi,  — oi,  04  +  8), 
a:^(oi,  02, 03)  =  (02,02  —  oi,  03  +  10), 

2:^(01,02,03)  =  (01,02,03). 

PQi  acting  on  A.  decomposes  A  into  distinct  i^6i -orbits  each  of  order  6. 

P61  is  also  a  crystallographic  group  denoted  by  P6  [13].  P6  is  generated  by  a. 

a(oi,  02, 03)  =  (oi  —  02,  oi,  03). 

P6-orbits  also  decompose  A  into  distinct  orbits.  A  P6-orbit  may  have  1,  2,  3  or  6  elements. 

P6(0, 0, 03)  =  {(0, 0, 03)},  0  <  03  <  11, 

and  (0, 0,  03)  are  fixed  points  of  P6. 

P6(4, 8, 03)  =  {(4, 8, 03),  (8, 4, 03)},  0  <  03  <  11. 

The  isotropy  subgroup  of  (4, 8,  03)  is  generated  by  a^. 

P6(6, 6, 03)  =  {(6, 6, 03),  (0, 6, 03),  (6, 0, 03)},  0  <  03  <  11. 

The  isotropy  subgroup  of  (6,6,03)  is  generated  by  a^. 

The  non-trivial  isotropy  subgroups,  {l,a^a'‘}  and  {l,^^},  where  1  denotes  the  identity 
automorphism,  are  again  crystallographic  groups  denoted  by  P3  and  P2  [13],  respectively. 
With  respect  to  (f>  defined  in  Eq.  (11), 

<  a~^(oi,  02,  Os),  </’(6i,  ^2,  ^3)  X  (<^2,  ®2  —  <^i)  03),  ^(^1,  ^2,  ^3)  > 

=<  (01,02, 03),  (;/'(— f)2,^i  +  ^2,^3)  > 

=<  (oi,  02, 03),  (;i(a’*‘(6i,  62,  63))  >, 

and  a^(6i,  62,  63)  =  (— f)2,  +  ^2,  ^3)- 

Example  E.2  P6/mmm 
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Crystallographic  group  P^lmmm  is  isomorphic  to  the  abstract  group 

Z/6^  Z/2  X  Z/2. 

We  will  describe  the  group  by  listing  the  3  generators. 


^(ai,  02,  as)  =  (02,01,-03),  =  0, 

7(01,02,03)  =  (01,02, -03),  7’^  =  7. 

This  is  a  nonabelian  group,  and  we  have  the  following  commuting  relations; 

I3a  =  a~^l3,  7a  =  a'j,  7/3  = 

Set  A  =  Z/12  X  Z/12  x  Z/12.  We  will  consider  isotropy  subgroups  of  elements. 

P6/mmm(4,8,6)  =  {(4,8,6),  (8,4,6)}, 


I  1,  7,  oTl, 

P6/mmm(4,8.6)  j  a^/37, 


2^,  ^4, 


For  o  7^  0, 6 


P6fmmm{4,S,a)  =  {(4,  8,0),  (8,4,  o),  (8,4, —o),  (4, 8,  1)}. 

=  |  ^3“^’  ^3“^  |  =  /-Sml, 


where  P3ml  is  a  crystallographic  point  group. 


Example  E.3  Pmmm 


Let  A  =  Z/2iV  X  Z/2M  x  Z/2T,  for  natural  numbers  N,  M  and  L.  Pmmm  <  Aut{A) 
is  generated  by  /)i(oi,  02, 03)  =  (-01,02,03),  ^2(01, 02,03)  =  (01,-02,03),  p3(oi,  02,03)  = 
(01,02,  -03).  Each  of  the  generators  is  of  order  2  and  Pmmm  has  8  elements.  With  respect  to 

the  isomorphism  defined  in  Eq.  (11), 

pf  =  pi,  i  =  1,2,3. 


The  subgroup 


B  =  {(61  iV,  62M,  63L)  :  61  =  0, 1,  1  =  1, 2, 3} 
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is  the  group  of  fixed  points  of  Pmmm.  Let 

Bi  =  {{h,b2M,b3L):0<h<2N -1,  62  =  0,1,  63  =  0,1}. 

PmmrriB^  =  {l^P'i,pz,P2p{\- 

^2  =  {(6iiV,62,63L)  :  0  <  62  <  2M- 1,  61  =0,1,  63  =  0,1}. 

PmmmB2  =  Pi^  P3,  PiPz}- 

Example  E.4  Fmmm 

Set  A  =  Z/2iV  X  Z/2M  x  Z/2L,  for  natural  numbers  N,  M  and  L.  The  crystallographic 
affine  group  Fmmm  <  Af  f  (A)  is 

B  X  Pmmm, 

where  B  <  A  is  the  fixed  subgroup  of  Pmmm  given  in  Eq.  (33).  Each  of  the  generators  is  of 
order  2  and  Fmmm  has  64  elements.  An  element  of  Fmmm  is  of  the  form 

{b,P?P?P?),  beB,  rk  =  0,1,  k  =  1,2,3. 

We  will  denote  the  elements  of  Fmmm  by  an  ordered  6-tuple  of  I’s  and  O’s  by  listing  the  values 
of  bj  and  rk  in  order,  i.e., 

(61 A^,  62  Af,  63Z',  Pj' />2^/^3^  )  ^  (61 , 62,  63,  Tj  ,  7'2,  Ts) . 

In  this  notation,  the  group  composition  in  Fmmm  is  given  by  componentwise  addition  modulo 
2  in  each  of  the  6  components.  We  will  also  index  the  elements  of  Fmmm  from  0  to  63  by  the 
binary  expansion  of  the  6  tuple, 

(61, 62, 63,ri,r2,  Ts)  ^  ti  +  2^2  +  4-^3  +  8ri  -f  16r2  +  32rz. 


In  this  notation 


B  =  {So,5i,  52,  ■53,54,35,56,37}. 


There  are  no  fixed  points  of  Fmmm. 


FmmmB  =  Pmmm. 

F mmm=  Pmmm  =  {sq,  sq,  Sie,  •524, 532, 540, 543, 556}. 
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E.3  FT  of  a  finite  abelian  group 

View  A  as  a  basis  of  L(A)  and  A**  as  a  basis  of  L(A*).  In  chapter  II,  we  defined  the  canonical 
isomorphism 

by 


0  :  A  ^  A* 


0(a)(a*)  =<  a,  a*  >,  a  G  A,  a*  G  A*. 
The  Fourier  transform  Fa  of  A  is  the  unique  linear  extension 

Fa  :  L{A)  ^  T(A*) 

of  0.  It  follows  that  Fa  is  a  linear  isomorphism  given  by 


FAf{a)  =  ^  /(a)  <a,a*  >,  /  G  L{A),  a*  G  A*, 

aeA 


with  inverse  given  by 

/  =  4  E  FAf{-a*)a\  f&L{A),  N  =  o{A). 

a*eA* 

The  coefficients  of  /  over  the  character  basis  are  given  by  AF^/(-a*),  a*  G  A*. 
For  an  isomorphism  <j)  :  A  A*,  define  the  FT  presentation 


(34) 


(35) 


(36) 


F^  :  L{A)  L{A) 


(37) 


by 


(FMa)  =  {FAfMa)),  f  e  L{A),  a  G  A.  (38) 

FT  presentations  associated  to  different  isomorphisms  differ  by  input  permutations.  The  choice 
of  4>  can  be  an  important  parameter  in  algorithm  design  especially  in  crystallographic  applica¬ 
tions  where  <f>  can  be  matched  to  crystallographic  symmetry  to  simplify  coding.  Throughout 

this  chapter  we  fix  an  isomorphism  f  :  A  —>■  A*. 

For  a  subgroup  B  of  A,  define  the  induced  Fourier  transforms 


by  the  formulas 


(39) 

L{AIB% 

(40) 

feL{AIB),  b^eB^, 

(41) 
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(F^J)(a  +  B^)  =  iFBfKM^  +  B^)),  feL(B),  a  e  A,  (42) 

where  4>,  and  h  are  defined  in  Eqs.  (13)  and  (14).  F^,  and  F^,  are  linear  isomorphisms. 
We  will  write  F^^  for  F^,  and'  F^^  for  F^^  when  we  want  to  bring  out  the  dependence  on  the 

subgroup  B. 

E.3.1  Periodization-Decimation 

Divide  and  conquer  algorithms  for  computing  the  action  of  F^  decompose  the  computation  into 
a  collection  of  induced  FT  computations.  In  this  chapter,  we  will  see  how  the  FT  interchanges 
the  fundamental  operations  of  periodization  and  decimation. 

For  a  subgroup  5  of  A  and  /  €  L{A),  Persf  €  L{A)  is  5-periodic  and  we  can  view 

Persf  €  L{AIB). 

Theorem  E.3  For  f  G  L{A),  F^iPersf)  vanishes  off  of  and 

=  F^ffPerem^),  b^eB^. 

Theorem  E.3  implies  we  compute  F^f  on  the  subgroup  B^  by  computing  the  induced  FT 

F4,ffPerBf). 

For  /  €  L{A),  we  can  view  Decsf  €  L{B). 

Theorem  E.4  For  f  G  L{A),  F^iDecsf)  is  B^-periodic  and 

PerB^{F^f){a)  =  o[B  )F^{^DtCBf){F) 

=  o{B^)F^ffDecBf){a^B^),  a  G  A. 


Theorem  E.4  computes  the  periodization  of  F4,f  relative  to  B^  by  computing  the  induced  FT 
F^ffDecBf). 

E.4  FFT  Algorithms 

E.4.1  Introduction 

Algorithms  are  distinguished  by  their  strategies  for  decomposing  the  global  computation.  Cooley- 
Tukey  fast  FT  (CT  FFT)  algorithms  partition  the  computations  into  FT  of  periodizations  or 
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decimations  relative  to  the  cosets  of  some  subgroup  B  of  A.  Recently  formulated  Reduced 
transform  (RT)  algorithms  decompose  the  computation  into  FT  of  periodizations  or  decima¬ 
tions  relative  to  a  collection  of  subgroups  covering  A.  Details  including  implementation  stages 
on  RISC  and  massively  parallel  multiprocessors  can  be  found  in  [14]  with  performance  results. 

In  this  chapter,  we  will  briefly  outline  the  structure  of  the  RT,  CT  FFT  and  GT  algorithms. 
Detailed  derivations  of  these  algorithms  can  be  found  in  [2,  19]. 


E.4.2  RT  algorithm 

RT  algorithms  decompose  the  computation  of  FT  into  a  collection  of  induced  FT  taken  over 
the  subgroups  of  a  covering  or  dual  covering  of  the  indexing  set.  One  form  of  the  RT  algorithm 
begins  with  a  dual  covering  B  of  A  and  computes  F^f  by 

•  forming  the  collection  of  periodizations 

Persf  G  L{AIB),  B  £  B 


•  computing  the  collection  of  induced  FT 


Fl(PtTBS),  -SeS. 


This  completes  the  computation  since  F^^Persf)  equals  F^f  on  B-*-  and  B  is  a  dual  covering 
of  A. 

A  dual  form  RT  algorithm  begins  with  a  covering  B  of  A.  For  each  ae  A  define  the  integer 
valued  function  //  on  A  by 


ix{a)  =  the  number  of  subgroups  in  B  containing  a. 


Define  the  weighted  decimations  of  /  by 


T)ec^/(a) 


^/(a),  aeB, 

0,  otherwise. 


Since  B  covers  A 


{='£  Dec%f 


Ftf  =  E  r*Dtc%f 

BeB 


(43) 

(44) 


and  we  can  compute  F^f  by 
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•  Forming  the  collection  of  decimations 

Dec%feL{B),  BeB. 

•  Computing  the  collection  of  induced  FT 

BeB. 

Redundant  computation  is  a  necessary  part  of  RT  algorithms.  An  analysis  of  the  advan¬ 
tages  and  disadvantages  of  RT  algorithms  can  be  found  in  [19].  Typically  these  algorithms  are 
targeted  to  large  size  MD  DFT  computations  on  shared  memory  multiprocessors  but  have  been 
implemented  on  distributed  memory  multiprocessors  with  significant  speed-up  as  compared  to 
standard  CT  FFT  implementations.  The  RT  algorithm  on  some  machines  can  be  bottlenecked 
by  the  I/O  bandwidth  required  in  the  initial  stage  periodizations  but  offers  complete  paral¬ 
lelization  (subject  to  the  number  of  processors  and  granularity)  afterwards  and  can  be  easily 
scaled  to  transform  size  and  machine  configuration.  This  should  be  compared  with  standard 
approaches  which  interleave  communication  and  computation  by  global  transpositions. 

In  applications,  say,  to  the  M-dimensional  FT,  the  collection  B  is  usually  taken  such  that 
duals  are  a  covering  set  of  A-dimensional  {K  <  M)  planes  through  the  origin.  The  dimension 
K  is  an  important  design  parameter  as  it  affects  local  granularity  and  global  parallelism. 


E.4.3  CT  FFT  algorithm 

Choose  a  subgroup  B  <  A.  One  form  of  the  CT  FFT  begins  by  subjecting  data  to  generalized 
periodizations  relative  to  B.  This  step  can  be  implemented  by  a  collection  of  Fourier  transform 
computations.  However  we  choose  to  express  this  step  as  a  collection  of  generalized  periodiza¬ 
tions  to  bring  out  the  analogy  with  the  RT  algorithm  and  to  clearly  distinguish  stages  requiring 
full  data  access  from  stages  acting,  in  parallel,  on  localized  data. 

Choose  a  subgroup  B  of  A.  For  /  €  L{A)  and  b  €  B  ^  define  fb-  G  L{A)  by 

fb>{a)  = F  b)  <  b,b  >,  a  €.  A.  (45) 

beB 

We  call  fi)*  a  generalized  periodization  since 


f^.{a  +  b)  =  <b,b*  >  fb-{a),  a  e  A,  beB. 


Theorem  E.5  For  f  €  L{A), 


f  =  ^ 


m 
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F^f  = 


1 


E 


It  follows  that  we  can  compute  F^f  by  computing  the  collection  of  FT  b*  e  B*. 

Consider  the  group  isomorphism  <f)2  '■  Aj ^  B*.  Choose  a  complete  system  of  5-^-coset 

representatives 

2(6-)  £«'(»*). 

Theorem  E.6  vanishes  off  of  the  B‘--coset,  z(b")  +  B‘-,  and 

FMz(b-)  +  6")  =  +  6"),  S 


Fih-  determines  F^f  on  the  B'-coset  r(6-)  +  B^,  h‘  6  B'.  Since  the  B^-cosets  form  a  disjoint 
partition  of  A,  the  computations 

F^A*)  b*  e  B* 

can  be  implemented  in  parallel  and  the  second  sum  in  theorem  E.5  requires  no  computation. 
Once  the  generalized  periodizations  are  computed,  the  computation  can  be  completed  in  parallel 
by  induced  FT  computations  which  output  F^/  on  F^-cosets.  This  is  accomplished  by  first 
performing  a  twiddle  factor  multiplication  of  generalized  periodizations  defined  as  follows. 

For  b*  e  B*,  define  gb*  €  L{A)  by 

gb^ia)  =  A*(«)  <  >>  0,^  A.  (48) 

gb*  is  F-periodic  and  can  be  viewed  as  a  function  in  H^Aj B). 

Theorem  E.7 

Febfb*{z{b*)  +  b^)  =  o{B)F^,gb^{b^).  6^  G  FF 


The  CT  FFT  algorithm  combines  theorems  E.6  and  E.7  and  computes  F^f  by  independent 
computations  of  F^f  on  the  disjoint  F^cosets  z{F)  +  F  by  the  collection  of  induced  smaller 

size  FT  computations 


(49) 


Group  Invariant  FFT 


165 


CT  FFT  Algorithm 


h.eL{A),  b*eB* 


g,,  €  L{A/B),  b*  €  B* 


=  F4„gi,-{b^)  • 


E.4.4  Good-Thomas  algorithm 

The  GT  will  be  derived  as  a  special  case  of  the  CT  FFT.  In  [2,  19],  a  direct  proof  was  given. 
Choose  a  subgroup  B  <  A.  We  require  that  A  has  a  direct  product  decomposition. 

A  =  BxC 

where  (7  is  a  subgroup  of  A.  Choose  group  isomorphisms 

<f)B  ■.  B  jB*,  <l)c  '■  G  ^  C*. 


The  mapping 


(j):  A A* 


defined  by 

<  (6',  c'),  <f>{b,  c)  >=<  b',  <f>{b)  >  <  c',  <f>{c)  >,  b,  b'  e  B,  c,c'  eC 
is  a  group  isomorphism.  Relative  to  <f> 

C  =  B^,  B  =  C^. 

Since  AjB  =  B^  and  AjB^  =  5,  and  =  4>b-  In  particular,  in  the  notation  of 

the  previous  chapter,  we  can  take 

=  b^eB\ 
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# 


which  amounts  to  taking  5  as  a  complete  system  of  5^-coset  representatives  in  A.  Under  these 
assumptions,  the  CT  FFT  takes  the  form 

=  b^sB\  (50) 

+  =  beB.b^eB^.  (51) 

•  Compute 

94>bW  €  b  e  B. 

•  Compute 

^4'sA9<l>B(b))  €  ^  ^  B. 

The  second  stage  is  a  collection  of  FT  computations  over  B^.  We  will  see  that  the  first 
stage  is  a  collection  of  FT  computations  over  B.  By  definition 

h'eB 

which  equals 

where 

f^^(b)  =  fib  +  b^),  beB,  b^eB^. 

The  precise  statement  of  the  stages  of  the  GT  can  now  be  given  as  follows. 

GT  algorithm 

•  Form  the  slices 

Ax  e  L{B),  b^  eB^. 

•  Compute  the  collection  of  FT  over  B 

F^,h^  e  L{B),  b^  e  B^. 

•  Form  the  functions 

9Mb) 

This  step  requires  data  transpose  (or  permutation). 
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•  Compute  the  collection  of  FT  over 

^  €  B. 

•  Set 

F^f{h  +  b-^)  =  F4,^^g^g^b){b^)- 
This  step  requires  data  transpose  (or  permutation). 


E.5  Examples  and  implementations 


For  applications  to  X-ray  crystallography,  we  will  take  a  3D  case  to  illustrate  the  theory  pre¬ 
sented  here.  In  particular,  the  smallest  non-trivial  case,  Z/12  x  Z/12  x  Z/12  is  used  in  many 
of  the  examples,  while  Z/3N  x  Z/3iV  x  ZI6M  and  Z/2N,  x  ZI2N,  x  Z/2N,  are  used  in  the 
implementation  for  several  natural  numbers. 

In  the  all  the  examples,  we  will  take  the  fixed  isomorphism  (j)  given  in  Eq.  (11).  To  simplify 
notation,  especially  in  presenting  covering  subgroups,  we  will  use  the  following  definition  and 

notation. 

Let  A  be  a  finite  abelian  group.  For  a  G  A  denote  by  <  a  >,  the  subgroup  of  A  generated 
by  a, 

<  a  >=  {a,  2a,  3a,  ■■■,(!<-  l)a}, 

where  K  is  the  smallest  positive  integer  such  that  Ka  =  0  €  A.  K  is  called  the  order  of  a. 


E.5.1  RT  algorithm 

Two  forms  of  RT  algorithm  will  be  derived  for  A  =  Z/3  x  Z/3  x  Z/3.  Using  CRT,  we  will 
extend  our  current  example  to  groups  of  the  form 

Z/3  -2^  X  Z/3  -2^  X  Z/6M 


# 


# 


for  integers  N  and  M. 

Example  E.5  RT  algorithm  I  for  A  =  Z/3  x  Z/3  x  Z/3 
Set  A  =  Z/3  X  Z/3  X  Z/3.  The  following  4  subgroups  cover  A. 

Bt  =<  (0, 1)  >  xZ/3,  Bf  =<  (1, 1)  >  xZ/3,  • 

B^  =<  (2,1)  >  xZ/3,  Bt  =<  (1,0)  >  xZ/3, 


m 


Group  Invariant  FFT 


168 


p{ai,a2,a3)  =  1  for  all  (01,02,03)  e  A,  except  //(0,0,0)  =  4.  With  respect  to  the  isomorphism 
defined  in  Eq.  (11))  we  have  for  6  =  0, 1,2 

=<  (1,0,0)  >,  ^2  =<  (1,2,0)  >, 

53  =<  (1,1,0)  >,  ^4  =<  (0,1,0)  >  . 

To  index  the  periodizations,  we  will  fix  the  coset  representatives  of  A/5r,  1  <  r  <  3  and  A/ B4 
as  follows. 


A/Br  :  (0,0,0),  (0,1,0),  (0,2,0), 

(0,0,1),  (0,1,1),(0,2,1), 

(0,0,2),  (0,1,2),(0,2,2),  r  -  1,2,3. 


A/B4  ;  (0,0,0),  (1,0,0),  (2,0,0), 

(0,0,1),  (1,0,1),  (2,0,1), 
((0,0,2),  (1,0,2),  (2,0,0) 

For  Cl,  C2  =  0, 1,2, 


/■  (0,  C. ,  c,)  =  E  m  Ci,  C,).  /.(O,  c„  c,)  =  5:  /(ft,  2i.  +  c. ,  c). 

6=0 

2  ^ 

73(0,  Cl,  C2)  =  /(^)  ^  ^1)  ^2),  /4(ci,  0,  C2)  =  X/  /(ci,  i>,  C2). 


6=0 


6=0 


The  collection  of  induced  FT  computations  is  implemented  by  the  4  independent  2D  3  x  3 
Fourier  transforms. 


2  2 


-^{aiCi+a2C2) 


C2=0  ci=0 
2  2 


FS^/2(ai,ai,a2)=  E  12  h{0,CuC2)e 


z^(aiCi+a2C2) 


C2  =0  Cl  =0 
2  2 


F^^V3(2ai,ai,a2)  =  E  E /3(0>^i>''2)e- 


=i|2^(aiCi+a2C2) 


C2=0  Cl  =0 
2  2 


C2  =0  Cl  =0 

Example  E.6  RT  algorithm  11  for  A  =  Zl3x  Z/3  x  Z/3 
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We  list  a  collection  of  13  covering  subgroups  along  with  their  dual  groups.  Each  of  the  covering 
subgroups  is  of  order  3,  while  the  dual  group  is  a  subgroup  of  order  9.  For  a  =  0,1,2  and 

6i,  62  =  0)  1)  2, 


i;,^  =  {(a,0,0)}, 
E)2^  =  {(0,a,0)}, 
D3  =  {(a,a,0)}, 

=  {(2o,a,0)}, 
=  {(2a,  2a,  a)}, 
De  =  {(0,a,a)}, 
D7  =  {(0,2a,  a)}, 
Z)J-  =  {(0,0,a)}, 
-O9  =  {(a,0,a)}, 
Dio  =  {(2o,0,a)}, 
Dii  =  {(®5®)*^)}) 
Du  =  {(2a,  a,  a)}, 
Dt^  =  {{2a,a,2a)} 


A  =  {(0,61,62)} 

1)2  =  {(61,0,62)} 

D3  =  {(61, 261, 62)} 

D4  =  {(61, 6i,  62)} 

D5  =  {(61, 62, 61  +  62)} 

Do  =  {(61, 62, 262)} 

Dj  =  {(61, 62, 62)} 

Z)8  =  {(61,62,0)} 

E>9  =  {(61,62,261)} 

Dio  =  {(61, 62, 61)} 

D\\  =  {(61, 62,261  +  262)} 
Du  —  {(61, 62, 61  +  262)} 

D\3  =  {(61, 62, 26i  +  62)}. 


/r(ai,a2,a3)  =  1  for  all  (ai,C2,a3)  e  A,  except  p{0,0,0)  =  13.  We  will  show  2  of  the 
computations  explicitly.  The  rest  follows  in  exactly  the  same  way.  To  index  the  periodizations 
with  respect  to  Dr,  set 

A/Ds:  {(0,0,0), (1,0,0),(2,0,0)},  (52) 

A/Ds:  {(0,0,0),  (0,0,1),  (0,0, 2)}.  (53) 

Usually,  coset  representatives  are  not  unique.  Note  that  although  the  collection  in  Eq.  (52) 
can  be  used  as  A/D5  as  well  as  A/D3,  Eq.  (53)  cannot  be  used  for  A/D3.  For  a,c  =  0,  1,  2, 


F’er£)3/(c,0,0)  =  ^  ^ /(61  +  c, 26i,  62), 
62  —0  61  =0 


2  2 


PerDj{0,0,c)=  E /(^1’^2,6i  +  62  +  c). 
62=0  61  =0 


F^:,3/3(«.«,0)  =  E/3(c,0,0)e'= 
c=0 

2 

^01,5/5(2^7  2a,  a)  =  ^/5(0,0,c)e 
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Remaining  cases  follow  in  the  same  way,  and  the  induced  FT  computations  are  implemented 
by  13  independent  3-point  FT. 

The  above  two  derivations- show  uniform  decomposition  of  a  3D  problem  into  2D  and  ID 
problems,  respectively.  However,  the  above  two  cases  can  be  combined  to  provide  various 

decompositions. 

Example  E.7  RT  algorithm  for  A  =  Z/2^  x  Z/2^ 

We  will  list  a  collection  of  covering  subgroups  of  A  and  their  dual  subgroups  of  order  2^  by 
listing  their  generators. 

is  covered  by  the  following  2^  +  2^"^  subgroups. 


Table  E.l  Covering  subgroups  of'Ll2^  x  Z/2^ 


subgroup 

generator 

dual  group  generator 

0<j  <2^ 

Mj 

(i,  1) 

(-Di) 

0<l<  2^-^ 

M2N+1 

(1,2/) 

(-2/,l) 

To  organize  the  periodizations,  we  will  set 

A/<(l,i)>  :  <(0,1)>,  0<i<2^, 

A/<(2/,l)>  :  <(1,0)>,  0</<2'^-T 

For  0  <  c  <  2^, 

T’erB^/(0, c)  =  Y^f{b,c+bj),  0<i<2  , 

6=0 

Per.«J(c,0)  =  0<  I  <  2«-'. 

6=0 

The  collection  of  induced  FT  is  implemented  by  2^  -f  2^“^  independent  2^-point  FT  com¬ 
putation. 

For  the  dual  RT  algorithm,  we  list  the  values  of  the  function  p  on  A  with  respect  to  the 
collection  of  covering  subgroups  given  in  table  E.l. 

Denote  by  (Jo  the  multiplicative  units  of  Z/2^,  i.e., 

Uo  =  {ae  Z/2^  :  a  =  1  mod  2.}. 

For  1  <  n  <  A/’ -  1,  Set 

=  {a  G  Z/2^  :GCD(a,  2^)  ^2-}. 
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Then 


Z/2^  =  U 


iV-l 

n=0 


Un. 


For  CL^  G  Un-)  0) 


Ai(ani,an)  =  2",  0  <  i  <  2^, 

^(a„,2a„0  =  2",  0  <  /  <  2^-\ 


//(0,0)  =2^  +  2^-^ 

Let  B  be  the  collection  of  covering  subgroups  of  Z/2^  x  Z/2^  given  m  table  E.l.  For  B  e  B, 
compute 

Dec^f. 

To  index  the  induced  FT  computations,  we  will  fix  A/B^-coset  representatives, 


A/ <  (-l,i)  >:<  (0,1)  >.  0<i<2^-l, 

A/<(-2/,1)>:<(1,0)>,  0</<2^-^-1. 

The  collection  of  induced  FT  computation  is  implemented  by  2^  +  2^'^  independent  2^-point 
FT.  To  complete  the  computation  of  F^,,  we  use  the  periodicity 

Fl{Dec%f){a  +  B^)  =  Fl{Dec%f){a) 


and  the  formula 

F't>f  =  E  F4,ABe4f). 

Example  E.8  Hybrid  RT/GT  algorithm 

Set  A  =  Z/3  •  2^  X  Z/3  •  2^  for  a  natural  number  N.  By  the  fundamental  theorem, 

A  ~  Ai  X  A2, 

where  A^  =  Z/2^  x  Z/2^  and  A2  =  Z/3  x  Z/3.  The  subgroup 

B  =  {(aiei,a2ei)  €  A  ;  0  <  ui, 02  <  2} 

is  isomorphic  to  Ai,  while 

=  {(11162,^262)  €  A  :  0  <  n\,n2  <2  —  1} 
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is  isomorhic  to  A2,  where  cj  and  62  are  the  idempotents  associated  with  the  isomorphism  in 
Eq.  (54).  We  have 

A  =  BxB^. 

Using  GT  algorithm,  we  can  compute  Fa  by  computing  followed  by  Fa2-  The  induced  FT 
computations  Fa^  and  Fa2  are  implemented  by  RT  algorithm. 

Example  E.9  Covering  subgroup  computation  via  CRT 

Covering  subgroups  and  their  dual  subgroups  for  A2  are  given  in  the  following  table. 

Table  E.2  Covering  subgroups  ofZ/3  x  Z/3 


k 

subgroup 

generator 

dual  group  generator 

0 

Lo 

(0,1) 

(1,0) 

1 

Lr 

(1.1) 

(2,1) 

2 

L2 

(2.1) 

(1,1) 

3 

Lz 

(1.0) 

(0,1) 

Ai  X  A2  is  covered  by 

{AixLi  :  0<j<  3}, 

while  dual  subgroups  are  given  by 

(0,0)  xLk}  :  0<k<2}. 

We  can  also  decompose  Ai  into  covering  subgroups.  To  see  this,  let  N  =  2. 
Table  E.3  Covering  subgroups  of  Ai  =  Z/4  x  Z/4 


j 

subgroup 

generator 

dual,  group  generator 

0 

Mo 

(0.1) 

(1,0) 

1 

Ml 

(1,1) 

(3,1) 

2 

M2 

(2,1) 

(1,2) 

3 

Ms 

(3,1) 

(1,1) 

4 

M4 

(1,0) 

(0,1) 

5 

Ms 

(1.2) 

(2,1) 

The  idempotents  in  this  case  are  ei  =  9,  62  =  4  and  the  collection 
5+fc  =  9M/+4i^  0<;<5,  0  <  <  3, 

of  24  subgroups  covers  A.  Each  subgroup  has  order  12,  given  in  table  E.4  on  the  next  page. 
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E.5.2  CT  FFT  algorithm 
Example  E.IO  CT  algorithm  for 

Set  w  =  ■  For  /  G  L(Z/12), 

a=0 

For  B  =  {0,  4.  8},  =  {0,  3,  6,  9),  relative  to  defined  in  Eq.  (11).  Generalized 

periodization  of  /  gives  rise  to  3  functions 

/o*(a)  =  /(«)  +  /(®  + 

/4.(o)  =  /(a)  +  w^f{a  +  4)  +  w^f{a  +  8), 

/3.(a)  =  /(a)  +  u;V(a  +  4)  +  +  8),  «  €  Z/12. 
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Table  E.4  Covering  subgroups  of  Z/12  x  Z/12 


(J.*) 

subgroup 

generator 

dual  group  generator 

(0,0) 

,^0,0 

(0,1) 

(1.0) 

(1,0) 

(9,1) 

(7,9) 

(2,0) 

B2,0 

(6,1) 

(1,6) 

(3,0) 

Bzfl 

(3,1) 

(1,9) 

(4,0) 

Bifl 

(9,4) 

(4,9) 

(5,0) 

Bsfi 

(9,10) 

(10,9) 

(0,1) 

Bq,i 

(4.1) 

(5,4) 

(1,1) 

Bi,x 

(1.1) 

(11,1) 

(2,1) 

B2,1 

(10,1) 

(5,10) 

(3,1) 

Bz,\ 

(7,1) 

(5,1) 

(4,1) 

Ba,\ 

(1.4) 

(8,1) 

(5,1) 

Bsa 

(1,10) 

(2,1) 

(0,2) 

Bo, 2 

(8,1) 

(1,4) 

(1.2) 

B\,2 

(5,1) 

(7,1) 

(2,2) 

B2,2 

(2.1) 

(1.10) 

(3,2) 

Bz,2 

(11.1) 

(1,1) 

(4,2) 

■64,2 

(5,4) 

(4,1) 

(5,2) 

Bs,2 

(5,10) 

(10.1) 

(0,3) 

Bo, 3 

(4,9) 

(9,4) 

(1,3) 

Bi,3 

(1.9) 

(3,1) 

(2,3) 

B\,z 

(10,9) 

(9,10) 

(3,3) 

B\,z 

(7,9) 

(9,1) 

(4.3) 

Ba,z 

(1,0) 

(0,1) 

(5.3) 

Bz,z 

(1.6) 

(6.1) 

By  Eq.  (46),  fb‘{a)  needs  to  be  computed  only  on  a  set  of  S-coset  representatives,  say, 
{0,  1,  2,  3}.  Thus  the  periodization  is  usually  implemented  by  4  independent  3-point  Fourier 
transform  of  the  strided  values  of  /. 

Choosing 


;2(0*)=0,  ^(4*)  =  1,  2r(8*)  =  2, 


^0*(u)  —  /o*(fl)) 
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g,.{a)  =  fs.{a){aj(2))  =  fs-{a)w^\  a  6  Z/12. 

{a,(t>(z{b*)))  is  the  so-called  twiddle  factor  . 

The  quotient  group  A/B  contains  4  elements,  B,  I  -{■  B,2  B  and  3  +  B.  Via  the  homo¬ 
morphism  and  the  5-periodicity  of  gb*,  we  have 

F^f{z{b*)  +  b^)  =  F^,gl{b^) 

=  '^9b-{a  + F){a  + B,4>i{b^)) 

a=0 

=  Y.9b>{a){a,(f>i{b^)). 

a=0 

Since  b^  =  36,  for  some  6  €  A  and  =  e^,  the  computation  of  5^  is  completed  by  the  3 
independent  4-point  Fourier  transform  of  gb»,  b*  €  B*. 

Example  E.ll  Multidimensional  CT  FFT 

A  =  Z/2iVi  X  Zf2N2  X  Z/2N3. 

B  =  {(0,0,0),(iVi,0,0),(0,iV2,0),(iVi,iV2,0),  (55) 

(0, 0,  V3),  (iVi,  0,  iVa),  (0,  N2,  Nz),  {N^Nz,  Nz)} 

=  {{biNi ,  bzNzi  bzNz)  •  6n  =  0  or  1,  n  =  1,2,3}. 

Label  the  elements  of  5  by  6fc,  0  <  <  7  in  the  order  given  above. 

Table  E.5  Values  on  B  of  characters  of  A. 
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•  Note  that  the  matrix  of  values  of  the  characters  in  table  E.5  is 


F{2)  ®  F{2)  0  F{2), 

where  0  denotes  the  matrix  tensor  product  and  F{2)  denotes  the  2-point  FT  matrix 


By  Eq.  (46),  we  need  to  compute  on  a  set  of  5-coset  representatives,  say, 

C  =  {(01,02,03)  :  0  <  Oj  <  Nj  -l,j  =  1,2,3}. 

Order  C  antilexicographically.  Denote  by  fo,  the  vector  of  values  of  /  on  (7  listed  in  order  by 
the  ordering  of  C.  Similarly,  define  the  vectors  f^,  0  <  fc  <  7  by  listing  the  values  in  order  of 

ffc  =  [/(c  +  h)] ,  c  €  C. 

Then  the  periodization  is  obtained  by  the  matrix  operation, 


■  ■ 

fo 

fi 

f2 

fN 

=  (F(2)  ®  In,  ®  ^(2)  ®  In,  ®  ^(2)  ®  In,  ) 

fs 

U 

f^s* 

fs 

fe 

f? 

where  Ik  denotes  the  K  x  K  identity  matrix. 


=  {(201,202,203)  :  0  <  oi  <  Nj  -  1,  i  =  1,2,3}. 

With  the  following  choice  of  5‘^-coset  representatives. 


z(6o.)  =  (0,0,0),  0(61*)  =  ^(^'2-)  =  (0,1,0),  2(i>3-)  -  (1,1,0)> 

2(64.)  =  (0,0,1),  2r(65*)  =  (1’0,1),  ^(^>6*)  =  (0,1,1),  ^(^>7-)  =  (1,1,1)- 
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Sb* 

Sb* 

f).* 

§6* 

=  m 

% 

S^>4 

g6* 

k 

Sbj 

.  k  . 

where  T  is  the  8AfiiV2A^3  x  8N1N2N3  diagonal  matrix  whose  entry  at  position  010203  +  ^^^1 
is 

<  (oi)  <^2,  <*3))  0  <  k  <7. 

Since 

Aj B  ci  —  Ti/Ni  X  Z/A^2  x  Z/A/3, 

the  induced  FT  is  of  size  Ni  x  N2  x  N3  applied  to  the  8  independent  functions  ghi,0<k<7. 

E.6  Affine  Group  RT  Algorithms 

E.6.1  Introduction 

A  class  of  affine  group  RT  algorithms  will  be  constructed  which  act  on  data  /  €  1(A)  invariant 
under  the  action  of  affine  subgroups  X  <  Af  f  {A).  The  effect  will  be  two- fold. 

•  reduction  in  the  number  of  required  induced  FT  computations. 

•  the  induced  FT  computations  will  be  on  data  invariant  under  a  collection  of  subgroups 
of  X. 

For  X  €  Aff{A),  we  define  two  actions  on  L{A). 

a;/(a)  =  /(a:o), 

x*f{a)  =  <  a^,  <l){ata)  >f{a* a).  (57) 


The  first  main  result  we  have  is 

Theorem  E.8 


F4xf)  =  x*F4f). 
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Proof 


F^{xf){c)  =  J^(a;/)(a)  <  a,<?^(c)  > 


a^A 


=  ^  f{oix(^  +  ^x)  ^  > 

a^A 

=  I]  /(«)  <  (“  “  '^(^)  > 

a  E  A 

=  <  a-^a^,  4>{c)  >  Yj  /(®)  <  M  > 

a^A 

=  x*F^{c). 


Corollary  /  is  a:-invariant  if  and  only  if  F^f  is  x’^-invariant. 

RT  algorithms  provide  a  general  framework  for  computing  the  FT  of  data  invariant  under 
affine  subgroups.  We  begin  with  data  invariant  under  point  groups. 

E.6.2  Point  group  RT  algorithm 

Choose  a  dual  covering  B  of  A.  The  RT  algorithm  computes  F^f,  f  e  L{A),  by  the  collection 
of  induced  FT  computations 

F^,PerBf.  BeB. 

We  will  now  describe  how  to  modify  this  form  of  the  RT  algorithm  when  /  is  invariant  under  the 
action  of  a  point  group  H  <  Aut{A).  This  invariance  will  reduce  the  number  of  required  induced 
FT  computations  to  a  set  of  induced  FT  computations  on  data  invariant  under  subgroups  of 

H. 

Suppose  /  in  ^f-invariant.  Choose  a  dual  covering  B  invariant  under  H, 

h{B)  eB,hG  H,  B  £B. 

The  collection  of  dual  subgroups  B^  is  invariant  under  H*  and  we  can  choose  a  subset  Bq  C  B 
such  that  B^  is  a  complete  system  of  i7*-orbit  representatives  in  B^.  Since  f  is  H  invariant, 
F^f  is  if  ^-invariant  and  it  suffices  to  compute  the  the  following  collection  of  induced  FT. 

FliPevBf).  Be  Bo.  (58) 

This  has  the  effect  of  reducing  the  number  of  induced  FT  required  to  complete  the  computation. 

The  periodized  data  Pers/,  B  e  Bo  inherits  some  of  the  data  redundancy  of  /.  For  a 
subgroup  B  <  A^  define 


Hb  =  {h  e  H  h[B)  —  P}. 
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h{cL  +  B)  =  ha  +  B,  h  6  Hb,  a  ^  A. 

Theorem  E.9  If  f  is  H -invariant  and  B  is  a  subgroup  of  A,  then 

PerBf[ha)  =  -Per/i-i(s)/(a),  a  €  A,  h  ^  H. 

In  particular,  Persf  £  P{Af B)  is  Hs-if^'^^Hant. 

By  the  theorem,  the  induced  FT  in  Eq.(  58)  is  computed  on  i^s-invariant  Persf ,  B  e  Bo-  To 
make  full  use  of  the  Ff-invariance  of  /  we  must  supply  code  which  makes  full  use  of  this  Hb- 
In  crystallogrphic  applications  we  can  choose  B  such  that  A/B  is  1-D  or  2-D.  Standard  point 
group  FFT  algorithms  can  be  applied  in  the  ID  case  (see  appendix).  2D  point  group  invariant 
FFT  algorithms  have  recently  been  implemented  using  variants  of  Winograd’s  multiplicative 

FFT  [3,5]. 

//■-invariant  RT  algorithm  Choose  a  dual  covering  B  oi  A  invariant  under  H  and  a  complete 
system  of  /f-orbit  representatives  Bo  in  B. 

•  Form  the  periodizations 

Persf  €  Li^AjB),  B  G  Bq. 

•  Compute  the  i/g-invariant  induced  FT  s 

F^^iPeref),  B  e  Bo- 

•  Compute 

F^^iPeref),  BeB, 

by  -invariance. 

Example  E.12  PQ-invariant  RT  algorithm  I 
Set 

/I3  =  Z/6M,  A  =  Z/3  •  2^  X  Z/3  •  2^^  x  A3, 
for  integers  N  and  M.  Using  the  Chinese  remainder  theorem,  we  can  write  A  as 

(e^Ai  +  62A2)  A3, 
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where  Ai  ~  Z/2^  x  Z/2^  and  A2  ^  Z/3  x  Z/3.  >1  is  covered  by  the  following  collection  of 
subgroups,  where  Lfc,  ^  =  0,  1,  2,  3  are  given  in  table  E.2. 

=  oiAi  +  62^1^  X  A3. 

Bk  =  {(0)0)}  +  ^2Lk  'X  {0}. 


P6*{B^)  =  {B^,B^,Bi}  P6*{Bt)  =  {B^}, 

and  {5^  :  0  <  A;  <  3}  is  a  P6*-invariant  covering  of  A.  Hence  for  P6-invariant  /  G  L{A),  we 
need  to  compute  FaJ  only  on  Bq  and  B2  • 

/o  =  PerS(,,  f2  =  P^rB^f. 

To  index  the  periodization,  set 

Aj Br  A\  ^2Li  ^  r  =  0, 1, 

AjBs  :  Ai  +  ^2^4  )  5  =  2,3. 

For  0  <  ,  n2  <  -  1 ,  0  <  A;  <  2,  0  <  m  <  6M  -  1, 

2 

ein2,  /(eini  +  ^2^;,  ein2  +  62^,  m), 

a=0 

2 

/2(^i^i5  ^1^2  +  ^2^^  “  X  ^2(A:  +  2a),  6177-2  +  e2a,  m). 

a=0 


/o(Q:^(eini  +  62^;,  ein2,  m)) 


/i(-eini  -  e2k,-ein2,m), 

2 

y~)  /(— eini  —  62^;,  — ein2  ■+  620,  m), 

a=0 


2 

^  /(eini  +  62^:,  ein2  -  620,  m), 

a=0 

+  62^:,  ein2,  m), 


/2(Q;(eini,  61712  +  62^,  tti)) 


=  /3(-ei«2,  ei772  +  62^:  —  6x712,  m) 

2 

=  /(  —  Cl 772  +  262CI,  6i772  +  ^2^  —  ^1^2  +  62a,  TTl) 

a=:0 
2 

=  ^ /(eiTli  +  62^  +  2620,  61772  +  620,  m) 

a=0 

=  /3(eini,  6x772  +  62^1,777) 
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F6bo  =  F6b,  =  P6b,  =  {1, =  ^2,  F6b,  =  P6. 

The  induced  FT  computations  Ff°  and  Pf^^  are  made  on  P2  and  P6  invariant  data,  re¬ 
spectively. 

Example  E.13  PQ-invaviant  RT  algovitliTn  II. 

We  can  further  reduce  invariance  condition  on  the  periodized  functions  by  applying  RT  on 
Ai.  To  this  end,  we  will  set  Ai  =  Z/4  x  Z/4,  and  use  the  covering  subgroups  that  are  given  in 
table  E.4.  The  collection 

B  =  X  2I3,  0  <  i  <  5,  0  <  <  3 

covers 

Z/12  X  Z/12  X  A3. 

The  dual  subgroups  are  given  by 

D,,k  =  B,,kx{0},  0<i<5,  0<A:<3. 

Let  a*Mj  x  A3  =  Mj>  x  A3  and  a*Lk  x  A3  =  Lk'  x  A3.  Then  we  have 

a*  +  t2Lk)  X  A3)  =  +  e2Lk')  x  A3. 

Thus  to  compute  the  P6*-OThit  decomposition  of  B,  we  first  decompose  the  collections  {Mj  x 
A3  :  0  <  i  <  5}  and  {Tfc  x  A3  :  0  <  <  4}  independently,  then  place  the  decomposition  into 

B  by  CRT. 

Table  E.6  P6* -orbit  decomposition  of  subgroups  in  Zjix  Z/4: 


(0,1) 

a#(0,l)  =  (3,l) 

a#(3,l)  =  (3,0) 

(0, 1)  > 

■  <(3,1)> 

<(3,0)  >=<(1,0)  > 

Mo 

M3 

M4 

(1.1) 

a#(l,l)  =  (3,2) 

a#(3,2)  =  (2,l) 

(1,1)  > 

<  (3,2)  >=<  (1,2)  > 

<(2,1)> 

Af, 

Ms 

M2 
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Table  E.7  P6* -orbit  decomposition  of  subgroups  in  Z/3  x  Z/3 


(0,1) 

.  a#(0,l)  =  (2,l) 

a#(2,l)  =  (2,0) 

<(0,1)> 

<(2,1)> 

<(2,0)>=<(1,0)> 

Lo 

L2 

L3 

(1,1) 

a*{l,l)  =  {2,2) 

<(!,!)> 

Ll 

<(2,2)  >=<(!,!)> 

We  have  the  following  -orbit  decomposition  of  A. 

P6*(Di,)  =  {Dio, -Dish  =  (K’ 

P6*(I)io)  =  ^0,2,  ^3,3},  =  W,o^ 

PQ*{Dio)  =  {P>S,0^D2,2,Pt3}^  PQ*{P>2,o)  =  {Dio, 
P^*{Di,)  =  P^*{Dti)  =  {Dt, 


Difi,  -^^3}  > 
Di^t  -^2,3}  ■> 
D\^2->  Di^}  I 
-^2,1} 


We  will  choose  as  PG'^-orbit  representatives, 


Bo  =  {^o'io,  AM- 


(59) 


It  is  easy  to  show  that  the  periodizations  of  P6-invariant  /  €  L{A)  with  respect  to  the  duals  of 
the  above  P6’*‘-orbits  representatives  are  P2-invariant,  and  the  induced  FT  computations  are 
made  on  this  invariant  data. 

Let  /  be  the  FT  of  a  P6-invariant  function  /  €  L{A).  f  on  Dj^k  G  Bo  is  determined  by  the 
induced  FT  of  Pyfc-periodized  function  fo,,,-  By  the  P6#-invariance  of  /,  for  example,  /  of 
Dq  q  determines  /  on  D^2  ■^4,3- 


/(O,  l,m)  =  /(ll,  l,m)  =  /(l,0,m), 


(0,1,  m)  €  Dofl, 


(ll,l,m)  €  ^3,2, 


(ll,0,m)  €Pi3- 


Example  E.14  P3-invariant  RT  algorithm 

Crystallographic  group  P3  is  generated  by  Since  P3  is  a  subgroup  of  P6,  P6#-invariant 
covering  of 


Z/12  X  Z/12  X  A3. 
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is  also  P3-invariant.  In  fact,  the  P3#-orbits  and  the  P6*-orbits  of  the  covering  subgroups  are 
the  same.  Thus  as  in  the  case  of  P6,  the  induced  FT’s  are  computed  only  on  the  collection  Bq. 
However,  the  periodized  functions  have  only  the  trivial  invariance,  and  symmetry  specific  FT 
routines  are  not  required. 

Example  E.15  PQlmmm-invariant  covering  for  Z/12  x  Z/12  x  A3. 

The  above  two  examples  lead  to  the  following  unifying  strategy. 

Choose  a  point  group  H  that  contains  sufficiently  many  subgroups.  Since  H*- 
invariant  covering  is  invariant  under  any  subgroup  K*  <  H* ,  for  K -invariant 
data,  RT  algorithm  proceeds  by  disabling  the  computations  except  on  the  K* -orbit 
representatives. 

As  an  example,  we  will  consider  the  crystallographic  PQlmmm  which  contains  all  the  trigo- 
nal  and  hexagonal  point  groups,  which  comprises  16  of  the  53  3D  crystallographic  point  groups. 


PQ  f  mmm*  {Diq) 
P6  /  mmm'^  [D^q) 

PQ/Tnmm*{DQQ) 

PQ  f  rnmm^  {D2fl) 
PQ I  mmm*  {Dq^) 
PI>  I  mmm*  {Dff) 


{DU, 

Dt,2^ 

r)l  f)! 
^2,35  -^5,05 

n-L 

^2p, 

Dp), 

{DU, 

Dtp, 

^0,3^  ^4,0) 

n-i- 

^0,2, 

Dtp} , 

Dtp, 

{DU, 

Dtp, 

DU). 

{DU, 

Dtp, 

DU). 

{DU. 

Dtp, 

DU). 

A  collection  P6/mmm’^-orbit  representatives  is 


and  the  computation  is  required  only  on  this  collection  of  subgroups  for  a  P6/mmm-invariant 
functions.  To  simplify  notation,  set  Hj,k  =  PQlmmmDj.y  the  invariant  group  of  the  Dfy 
periodized  functions. 


Hop  =  H2,o  =  Hop  =  H^p  = 


The  induced  FT  computations  are  made  on  the  Hop  or  i7i,o-invariant  functions. 
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^  =  Z/3-2^  X  Z/3-2^  X  Z/6M. 

By  the  fundamental  theorem, 

A  ~  Z/2^  X  Z/2^  X  Z/3  X  Z/3  x  Z/3  •  2^. 

Let  ei  and  62  be  the  system  of  idempotents  associated  with  the  isomorphism 

Z/3-2^-Z/2^  xZ/3 

and  again  set  A3  =  Z/6M. 

P  =  (eiLfc  +  e2M/-)  X  A3, 

where  Lf  and  M/  are  collection  of  covering  subgroups  in  Z/2^  x  Z/2^  and  Z/3  x  Z/3, 
respectively  as  listed  in  tables  E.2  and  E.l.  For  easier  reference,  we  repeat  the  tables  here. 


Table  E.l  Covering  subgroups  ofTjl2^  x  7tl2^ 


subgroup 

generator 

dual  group  generator 

0  <  i  <  2^ 

M,-  ^ 

(i,i) 

(-i.j) 

0  <  ;  <  2^-^ 

(1,2/) 

1  v-> 

We  will  denote  this  collection  by  B. 


Table  E.2  Covering  subgroups  ofTijZ  x  Z/3 


k 

subgroup 

generator 

dual  group  generator 

0 

Lo 

(0,1) 

(1.0) 

1 

Lr 

(1.1) 

(2,1) 

2 

L2 

(2,1) 

(1,1) 

3 

Ls 

(1,0) 

(0.1) 

It  is  straightforward  show  that  P  is  a  P6/mmm# -invariant  dual  covering  of  A.  We  will  give 

the  P6lmmm*-ovhit  decomposition  of  P.  Recall  ^  and  7*  =  7. 

P6/mmm#-orbit  structure  in  Z/3  x  Z/3  is  the  same  as  that  of  Pd*,  since  actions  by  /3  or 

7  does  not  change  the  orbit  structure. 

PQlmmm*[Lo)  =  {Lq,  L2,  L3} ,  P6lmmm*(^L\)  —  {Li}- 
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/3{Lo)  =  L3,  0{Li)  =  Lx,  ^{12)  =  L2. 

P6’^-orbit  of  <  (;,  1)  >, 

P6*  <  (i,  1)  >=  {<  (j,  1)  >,  <  (-!> J  +  !)>.<  (-i  -  >} 

contains  three  distinct  subgroups.  To  see  this,  note  first 

<  (-i,i  +  i)  >  =  <  >> 

<  (-j-l,;)  >  =  <  >  . 

As  j  ranges  through  Uo,  j-\-J  -  1)  ranges  through  Z/2^  -  Uo,  and  -j  -  1  ranges  through  21, 
0  <l  <  -  1.  In  fact,  we  have  the  following  partitioning  of  B  into  P6*-orbits. 

u  {<  ih  1)  >,  <  {-I.i  +  !)>><  H  -  i>i)  >}• 

j^Uo 

13  maps  <  (i,  1)  >  onto  <  (l,i)  >•  We  will  show  that  there  are  exactly  4  subgroups  of  the  form 
<  {j,  1)  >  with  j  €  Uo  that  are  /^-invariant.  Suppose 

<(i,l)>  =  <(l,;)>  =  <(i'\l)>- 

Then  P  =  1  mod  2^.  j  €  Uo  can  be  written  as  2Z  +  1,  0  <  /  <  2^"'  -  1.  In  terms  of  I,  the 
following  congruences  hold. 

(2/  +  l)^  =4P  +  4Z  +  1  =  1  mod  2^. 

4/(Z+l)  =  0  mod  2^. 

Z(/+l)  =  0  mod2^~^. 

The  last  congruence  has  exactly  4  solutions  for  0  <  Z  <  2^"^  -  1, 

Z  =  0,  — ^  j  I) 

Z  =  2^-2-1,  i  =  2^-'-l, 
l  =  =>  i  =  2^-'  +  l, 

Z  =  2^"^  -  1,  i  =  2^  -  1. 

Partitioning  of  B  into  P6/mmm* -orbits  is  given  below. 

•  1  <  i  <  2^-^  -  3, 

{  <  (2;  +  1, 1)  >,  <  (-1, 2;  +  2)  >,  <  (-2;  -  2,2j  +  1)  >, 

<  (1,2;  +  !)>,<  (2;  +  2,  -1)  >,  <  (2;  +  1,  -2;  -  2)  >}, 
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.  {<(1,1)>,  <(-l,2)>,  <(-2,l)>}, 

{  <  (2^-1  -  1,1)  >,  <  (-1,2^-!)  >,  <  (-2^-', 2^-'  -  1)  >  }, 

{  <  (2^-^  +  1, 1)  >,  <  (-1, 2^-^  +  2)  >,  <  (-2^^^-^  -  2, 2^-^  +  !)>}. 

{<  (-1,1)  >,  <  (-1,0)  >,  <  (0,-1)  >}. 

There  are  P6/mmm^-orbits  in  B,  4  of  which  contain  3  subgroups.  Action  by  7  does  not 

change  the  orbit  structure. 

We  list  two  examples  of  P6/mmm’^-orbits  in  V. 

Set  /  =  2j  +  1.  From  the  orbit  of  <  (/,  1)  >  in  P  and  To,  we  obtain 

<  (ei/,  1)  >  X A3,  <  (-ei  +  262, 61/  +  62)  >  X A3, 

<  (-61/  +  62,  eiO  >  X  A3,  <  (1,  eiO  >  XA3, 

<  {eil  +  262,  -ei  +  62)  >  x^3,  <  (eiT -eil  +  62)  >  x  A3. 

from  the  orbit  of  <  (1, 1)  >  and  To,  we  obtain 

<(ei,l)>xA3,  <  (-61  +  262,  61  +  1)  >  XA3,  <  (-261  +  62,61)  >  X  A3. 

In  V,  there  are  4  -  •  •  2^"^  P6/mmm#-orbits,  4  of  which  contain  3  subgroups,  the  rest  contain 
6  subgroups. 

For  completeness,  we  list  the  values  of  idempotents. 

(1)  If  2^  =  1  mod  3,  then 

61  =  2^+^  +  1,  62  =  2^. 


(2)  If  2^  =  2  mod  3,  then 


61  =2^  +  1, 


62 


=  2^+F 


Choose  a  P6/mmm-invariant  function  /  G  T(A).  By  the  invariance,  the  induced  FT  com¬ 
putation  only  on  a  collection  of  P6/mmm#-orbit  representatives  determines  the  FT  of  /.  As  in 
example  6.4,  the  periodized  functions  are  invariant  under  one  of  the  two  subgroups  of  P6/mmm, 
Tfo.o,  or  Hifl.  Specifically,  a  periodized  function  fo  is  Tfi,o-mvariant  if  the  PQjmmm*  orbit 
of  D  contains  6  subgroups,  while  fo  is  iTo.o-invariant  if  the  P^jmmm*  orbit  of  D  contains  3 

subgroups. 
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E.6.3  Affine  group  RT  algorithm 

Choose  a  subgroup  X  of  Aff{A)  and  denote  the  point  group  of  X  by  X-  For  X-invariant 
/  S  L{A)  we  have 


F<t,f{a*a)  =<  a^,  (i>{a*a)  >  F^f{a),  a  e  A,  x  e  X. 


(60) 


.# 


.  # 


F^f  is  not  invariant  under  X  but  F^f{a)  determines  F^f  at  each  point  in  the  X  -orbit  of 


a. 

Choose  an  A-invariant  dual  covering  B  oi  A  and  a  complete  system  Bo  of  A-orbit  represen¬ 
tatives  in  B.  B^is  a  complete  system  of  X  representatives  in  the  covering  B^  of  A.  In  the 
presence  of  X-invariance,  the  RT  algorithm  can  be  implemented  by  first  computing  the  induced 

FT 

FliPerefl  Be  Bo- 

The  remaining  induced  FT  computations  can  be  determined  by  complex  multiplications  im¬ 
plied  by  theorem(  E.8).  The  X-invariance  of  /  reduces  the  number  of  required  induced  FT 
computations. 

For  any  subgroup  B  <  A,  define 


Xb  —  {x  e  X  ocx{B')  —  R}. 


Xb  is  a  subgroup  of  X  and  acts  on  L{AIB). 

Theorem  E.IO  If  f  is  X-invariant  then  PctbI  €  L{AI B)  is  Xs-invariant. 

By  the  theorem  the  induced  FT  computations 

F^^iPerBf),  Be  Bo 

are  taken  on  Xs-invariant  data.  To  make  full  use  of  the  X-invariance  of  /  we  must  provide 
code  which  make  full  use  of  the  X^-invariance  of  Rerg/,  B  G  Bo-  In  1  or  2  D,  affine  group 
invariant  FFT  algorithms  are  substantially  simpler  due  to  the  restricted  class  of  1  or  2-D  affine 
group  actions  (see  appendix). 

X-invariant  RT  algorithm  Choose  an  X-invariant  dual  covering  B  oi  A  and  a  complete 
system  Bo  of  X-orbit  representatives  in  B. 

•  Form  the  periodizations 

PevBf  ^  L{AI B)-!  B  e  Bo- 
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•  Compute  Xs-invariant  FT 

Ff^(FerBf), 


BgBo. 

B€B, 


•  Compute 
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Example  E.17  Affine  group-invariant  RT 

There  are  5  affine  crystallographic  groups  whose  point  group  is  P6. 
Table  E.8  Affine  groups  with  point  group  P6 


group 

generator 

P6i 

(0,0,  M,  a) 

P62 

(0,0,2M,a) 

P63 

(0,0,3M,a) 

P64 

(0,0, 4M,  a) 

Pbs 

(0,0,5M,a) 

RT  algorithm  proceeds  as  in  the  case  of  P6.  Now  the  invariance  condition  on  FT  is  given 
by  Eq.(  60).  For  0  <  /  <  5,  a  P6;-invariant  /  G  L{A),  the  induced  FT  of  the  T)j_fc-periodization 
of  /  determines  /  on  Df^j^  €  ISo-  To  determine  /  on  P6’^-orbits  of  set 

<  (ci,C2,C3),^(0,0,M)  >=w  =  exp^. 


/(ci,C2,C3)  =  w‘'^‘f{a*{ci,C2,C3)) 

=  u;^‘=^7((a^)#(ci,C2,C3)) 

=  w'^^^^f{{a‘^)*{ci,C2,C3)) 

=  W^^^'f{i<^^)*ic3,C2,C3)),  l</<5. 


1  <  /  <  5. 

The  group  that  contains  all  of  the  48  tetragonal  crystallographic  groups  is  Pifmmm.  As 
in  the  case  of  PQfmmm,  once  a  P4/mmm*-mvanant  covering  subgroup  is  partitioned  into 
Pi/mmm* -orbits,  a  code  for  the  RT  algorithm  with  respect  to  this  partitioning  contains  codes 
for  FT  computation  of  functions  invariant  under  subgroups  of  P4/mmm. 

One  can  also  choose  a  group  that  contains  all  the  crystallographic  point  groups;  This  group 
need  not  a  crystallographic  group. 

E.6.4  A’^-invariant  RT  algorithm 

Consider  a  subgroup  X  of  Aff{A).  In  many  applications  we  will  have  to  compute  the  inverse 
FT  of  A’*‘-invariant  data.  Up  to  index  reversal,  this  problem  is  equivalent  to  computing  the 
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FT  of  X’^-invariant  data.  We  will  embed  this  problem  in  the  second  form  RT  algorithm.  In 
problems  requiring  several  stages  of  FT  and  inverse  FT,  it  makes  sense  to  follow  the  first  form 
RT  algorithm  which  outputs  decimated  data  by  the  second  form  RT  algorithms  which  inputs 
decimated  data  and  conversely,  removing  the  necessity  of  data  rearrangement  steps  at  each 
cycle. 

In  the  second  form  of  RT  algorithm  we  compute  F^/,  /  €  L{A)  by  first  computing  the 
collection  of  induced  FT 

Fl{Dec%fl  BeB. 

Theorem  E.ll  For  a  subgroup  B  <  A,  if  f  E  L{A)  is  X* -invariant,  then 

F^{DecBf){-a)  =  F^{Dec^#gf){-xa),  a  €  A,  x  €  X  (61) 


Proof 

F^(Fecs/)(-c)  =  2/(6)<6,<^(c)> 

b^B 

beB 

=  /(^)  <  4>{0‘xC  -  Ca:)  > 

bec>*B 

=  F,^{Dec^#gf{-xc). 

•  #  •  ^  . 

Choose  an  X  -invariant  covering  B  of  A  and  a  complete  system  Bo  of  X  -orbit  represen¬ 
tatives  in  B.  It  suffices  to  compute  the  collection  of  induced  FT 

Fl{DtCBhB€B„ 

The  remaining  induced  FT  computations  can  be  computed  from  the  theorem. 

Set 

Xb  =  {xEX-.  a,{B)  =  B}. 

.# 

Theorem  E.12  For  X*-invariant  f  €  L{A)  and  B  <  A,  Decsf  is  X  -invariant. 

DecBf{b)  =  <ax,<f>iafb)>DecBf{ocfb),  b€B,x€XB- 


In  3D  crystallographic  applications,  specialized  routines  as  described  in  the  preceding  two 
subsections  can  be  applied  to  these  induced  FT  computations. 
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E.7  Implementation  Results 

We  have  implemented  symmetrized  3D  crystallographic  FFTs  for  the  case  of  P6  symmetric 
data.  The  data  is  assumed  to  be  defined  on  the  Z/3N  x  Z/3N  x  Z/6M  lattice,  where  N  and 
M  are  powers  of  two. 


Algorithm  1 

1.  Use  CRT  to  re-index  the  data  set  such  that  the  problem  is  transformed  to  an  equivalent 
5D  computation: 

Z/3N  X  Z/ZN  X  Z/6M  — >  Z/3  x  Z/3  x  Z/iV  x  Z/N  x  Z/6M. 

Although  this  step  is  computationally  expensive,  involving  irregular  accessing  of  the  data 
stored  in  the  main  memory,  it  should  be  noted  that  in  many  applications  where  a  large 
number  of  iterations  of  the  forward  and  inverse  FFT  are  required,  the  CRT  re-indexing 
can  be  carried  out  only  once  and  then  the  optimization  can  be  performed  in  the  5D 
domain. 

2.  Apply  the  RT  algorithm  to  the  Z/3  x  Z/3  to  compute  the  periodized  data  on  two  out  of 
the  total  four  subgroups.  The  periodization  results  in  two  distinct  data  sets,  Ai  and  A2, 
each  defined  on  Z/3  x  Z/N  x  Z/N  x  Z/6M. 

3.  Perform  two  4D  FFTs  on  the  data  sets  Ai  and  A2  to  implement  the  induced  FT.  The  sets 
Ai  and  A2  are  P2  and  P6  symmetric  correspondingly,  such  that  efficient  symmetrized 
FFT  code  can  be  used  for  the  computations. 

If  symmetrized  FFT  code  is  not  used  in  step  3,  the  computational  savings  are  roughly  in  the 
order  of  1/2.  In  Figure  E.l  we  plot  the  speedup  over  the  non-symmetrized  FFT  versus  the  size 
of  the  data  set. 
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Figure  E.l  Speedup  of  the  P6  symmetrized  FFT  over  the  non- 
symmetrized  FFT  versus  the  data  size.  Symmetrized  RTA  on  "LIZ  x  Z/3. 

Speedup 

2. 

2. 

2. 

2. 

2. 

2. 

1. 

1. 

1- 

1. 

The  second  implementation  results  in  even  more  speedups  over  the  non-symmetrized  FFT: 

Algorithm  2 

1.  Use  the  CRT  to  re-index  the  data  set  such  that  the  problem  is  transformed  to  an  equiv¬ 
alent  5D  computation: 

Z/3iV  X  Z/3A  X  Z/6M  — ^  Z/3  x  Z/3  x  Z/7V  x  Z/A  x  Z/6M. 

2.  Apply  the  RT  algorithm  on  Z/3  x  Z/3  x  ZjN  x  ZjN  and  compute  the  periodized  data 
on  one  third  of  the  total  4  x  (3/2)  A  subgroups.  The  periodization  results  in  2N  distinct 
data  sets,  each  defined  on  Zj^M. 

3.  Perform  2N  independent  ID  FFTs  on  data  of  length  6Af.  These  distinct  data  sets  are 
P2  symmetric,  so  that  efficient  P2-symmetrized  FFT  code  can  be  used. 

If  symmetrized  FFT  code  is  not  used  in  step  3,  the  computational  savings  are  roughly  in  the 
order  of  1/3.  In  Figure  E.2  we  plot  the  speedup  over  the  non-symmetrized  FFT  versus  the  size 
of  the  data  set.  If  P2  symmetrized  FFT  code  is  used,  the  computational  savings  are  roughly  in 
the  order  of  1/6  which  is  the  theoretical  maximum  since  the  original  data  are  P6  symmetric. 

Figure  E.2  Speedup  of  the  P6  symmetrized  FFT  over  the  non- 
symmetrized  FFT  versus  the  data  size. 
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Speedup 


The  P6  symmetrized  RT  algorithm  based  FFTs  share  the  highly  parallelizable  structure  of 
the  general  RT  algorithm.  A  variety  of  choices  of  a  multiprocessor  algorithm  are  available  al¬ 
lowing  for  efficient  implementations  depending  on  the  characteristics  of  the  particular  platform. 
Consider  for  example  Algorithm  1.  If  two  processors  are  available  and  all  of  the  2-3- N  ■  N ■  6M 
data  set  is  stored  in  each  processor,  no-interprocessor  communication  is  needed  since  each  pro¬ 
cessor  can  independently  compute  the  periodization  and  4D  FFT.  If  only  half  of  the  data  is 
stored  in  the  memory  of  each  processor,  then  in  order  to  compute  the  periodizations,  each  pro¬ 
cessor  has  to  send  its  data  to  the  other,  resulting  in  a  total  amount  of  communication  (number 
of  processors  x  size  of  messages)  equal  2  ■  3  •  N  ■  N  •  6M. 

li  P  >  2  processors  are  available,  the  data  can  be  divided  along  the  last  dimension  into  sets 
of  size  2-3  -  N  •  N  •  QM/P,  each  set  being  stored  into  the  local  memory  of  one  processor.  After 
the  computation  of  the  periodizations,  each  processor  keeps  3  •  N  •  N  ■  6M/P  of  local  data, 
and  then  performs  local  FFTs  along  the  first  three  dimensions.  To  complete  the  computation, 
FFTs  along  the  last  dimension  have  to  be  performed.  Since  the  data  are  distributed  among  the 
processors  along  the  last  dimension,  a  global  transposition  is  required:  Each  processor  keeps 
1/P  of  its  local  data,  and  sends  (P  -  1)/P  data  to  other  processors.  The  total  communication 
requirements  are  then:  (P  —  1)  x  local  data  size  =  (P  ~  1)  x  3  •  N  •  N  •  6M/ P.  In  an  alternative 
implementation,  P  processors  are  being  divided  into  P/2  clusters  of  two  processors,  with  local 
data  being  duplicated  within  each  cluster.  In  this  implementation,  each  node  stores  twice  as 
many  data  as  before,  but  the  efficiency  can  be  increased  in  certain  multiprocessor  networks 
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since  now  the  global  transposition  step  is  replaced  with  two  independent  global  transpositions 
each  involving  only  -P/2  nodes. 

E.7.1  Complexity 

E.7.2  Row-Column  Algorithm 

Set 

A  =  Z/3N  X  Z/3N  X  Z/3M. 

The  computation  of  the  3D  FT  using  conventional  row-column  algorithm  of  processing  the  data 
dimension  at  a  time  on  many  parallel  systems  pays  considerably  higher  price  on  interprocessor 
communication  than  FT  computation.  RT  algorithm  offers  an  alternate  data  movements  in 
MD  FT  computation.  We  list  some  performance  results  here. 

GT/RT  algorithm  I 

Using  CRT, 

A  -  Ai  X  A2  =  (Z/3  X  Z/3)  X  (Z/3  x  Z/N  x  Z/N  x  Z/M). 

Data  reduction  (periodization)  stage  costs  4  x  2  x  3N'^M  additions,  which  can  be  combined 
with  data  loading  operation  in  a  broadcasting  mode;  on  some  parallel  systems  it  is  given  for  free. 
In  a  4  processor  system,  each  processor  carries  out  2  x  3N'^M  additions,  while  receiving  input 
data,  followed  by  a  local  5D3x3xNxNxM.  FT  computation.  This  algorithm  eliminates 
interprocessor  communication  completely,  and  each  processor  has  balanced  load  with  uniform 
computation  format. 

E.7.3  GT/RT  algorithm  II 

A  ~  Ai  X  A2  =  (Z/3  X  Z/3  X  Z/3)  x  {Z/N  x  Z/N  x  Z/M). 

In  this  decomposition,  each  processor  carries  out  (2  x  3)  x  N'^M  additions  to  implement 
periodization  while  receiving  input  data,  followed  by  a  local  4D  3x  N  x  N  x  M  FT  computation. 
This  decomposition  is  well  suited  on  a  13  processor  system.  Both  reduction  and  FT  computation 
are  carried  out  in  parallel. 

The  RT  algorithms  I  and  II  show  uniform  decomposition  of  a  3D  problem  into  subsets.  The 
combination  of  RT  algorithms  with  other  fast  algorithms  will  provide  a  highly  scalable  feature 
that  can  be  matched  to  various  degrees  of  parallelism  and  granularity  of  a  parallel  system. 
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The  RT  algorithm  partitions  input  data  at  the  global  level  to  match  each  subset  into  node 
processors,  carrying  out  loading  and  reduction  operations  concurrently  at  each  node,  then  FT 
computations  are  performed  in  parallel. 

In  tables  E.9,  E.IO,  timing  results  on  the  Intel  iPSC/860  with  4  and  8  node  implementations 
are  given.  The  timing  results  of  the  next  power  of  2  sizes  of  Intel  FFT  library  are  also  included 
for  comparison.  (Non-power  of  2  routines  are  not  available  in  the  standard  library.)  The 
GT/RT  algorithm  I  was  implemented  on  the  4-node  hypercube  architecture. 

The  periodization  (reduction  stage)  is  coded  in  standard  Fortran  whereas  the  FFT  and  3- 
point  FT  calls  on  the  Kuck  k  Associates  optimized  assembly  routines  and  our  own  vectorized 
3-point  FT  routines  respectively. 


Table  E.9  Timing  Results  on  iPSC/860  (3-D)  (4-nodes) 


GT/RT  (4-nodes) 

Row-Column  (4-nodes) 

size 

time 

size 

time 

48  X  48  X  48 

360  ms 

64  X  64  X  64 

566  ms 

48  X  48  X  96 

572  ms 

64  X  64  X  128 

1122  ms 

48  X  96  X  96 

980  ms 

64  X  128  X  128 

2202  ms 

Table  E.IO  Timing  Results  on  iPSG/860  (3-D)  (4-nodes) 


GT/RT  (4-nodes) 

Row- Column  (8-nodes) 

size 

time 

size 

time 

48  X  48  X  48 

360  ms 

64  X  64  X  64 

282  ms 

48  X  48  X  96 

572  ms 

64  X  64  X  128 

585  ms 

48  X  96  X  96 

980  ms 

64  X  128  X  128 

1152  ms 

96  X  96  X  96 

2029  ms 

128  X  128  X  128 

2276  ms 

E.8  Affine  group  CT  FFT 

The  global  decomposition  stage  of  a  CT  FFT  algorithm  computes  pseudo-periodizations  relative 
to  a  subgroup  B  of  the  indexing  group  A.  In  this  chapter  we  present  a  CT  FFT  algorithm 
whose  pseudo-periodizations  are  taken  relative  to  an  abelian  subgroup  X  <  Af f{A).  In  the 
classical  case,  X  consists  of  pure  translations.  If  5^  is  a  subgroup  of  X  the  CT  FFT  algorithm 
associated  to  X  can  easily  be  adopted  to  produce  an  FFT  algorithm  for  Y^-invariant  data.  Code 
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which  implements  this  CT  FFT  produces  by  a  process  of  disabling,  K-invariant  FFT  code  for 
every  subgroup  Y  of  X. 

For  applications,  the  choice  of  X  is  motivated  by  two  factors.  First  code  for  the  CT  FFT 
associated  to  X  should  be  simple  to  write,  scalable  and  efficient.  Second  X  should  contain  a 
large  collection  of  subgroups  of  interest  in  applications. 


E.8.1  Extended  CT  FFT:  abelian  point  group 

Choose  /  €  L(A)  and  an  abelian  subgroup  G  of  Aut(A).  For  7*  G  G*  define  the  pseudo¬ 
periodizations  /y*  €  L{A)  by 


/r(«)  =  E/(7)  <7)7*  >)  a£A. 

'yeG 

(62) 

Since 

^  ^  1  o{G),  7  =  identity  map, 

'  \  0)  otherwise. 

(63) 

we  can  write 

f=  A)  F  /.■■ 

(64) 

We  can  compute  F^f  by  computing  the  collection  of  FT’s 

FtU,  r  €  G". 

(65) 

We  have  replaced  a  single  FT  computation  by  a  collection  of  FT  computations.  However, 
the  pseudo-periodizations  satisfy  the  following  group  invariance  property. 

Theorem  E.13  For  7*  €  G*, 

=  <  7)7*  >  /7*(®)5  a  £  A,  ^  €G. 

=<  7)7*  >  ae  A,^  £  G. 


We  will  say  that  /.y.  is  G-invariant  with  character  .  The  CT  FFT  associated  to  G  decomposes 
the  computation  of  F^f  into  a  collection  of  FT  computations  on  C?-invariant  with  character 
data  which  can  be  implemented  by  simple  modifications  of  the  point  group  RT  algorithm. 

Suppose  Ff  is  a  subgroup  of  G.  If  we  begin  with  a  Ff-invariant  data,  we  can  reduce  the 
number  of  FT  computations.  Set 


FC.  =  {7*  e  G*  :<  K,7*  >=  1,  for  all  K  €  K}. 


(66) 
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/C  is  a  subgroup  of  G*  isomorphic  to  the  character  group  {GjKY.  Choose  a  complete  set  of 
representatives  of  /C-cosets  in  G 

7o,  7i>  •  •  •)  7L-1-  (67) 

Then  every  g  E.  G  can  be  written  uniquely  in  the  form 

'j  =  kEK,0<1<L.  (68) 

Theorem  E.14  If  f  €  L{A)  is  K -invariant  then  the  pseudo-periodization  /-,•  vanishes  unless 

7*  €  AT. 

L-\ 

Proof  /Y*(a)  =  YlYh  > 

/=0 

L-1 

=  I^/(7i«)  <7;>7*>  E  <  «,7*  > 

/=0  k^K 

by  AT-invariance.  Since  ^>7*  ^  vanishes  unless  7*  G  A"*,  the  proof  of  the  theorem  is 

complete. 

Code  for  the  CT  FFT  algorithm  associated  to  G  applies  to  the  computation  of  the  FT  of  the 
K -invariant  data,  K  <  G,  by  disabling  all  the  pseudo-periodizations  corresponding  to  7*  ^  A*. 

E.8.2  CT  FFT  with  respect  to  Pmmm 
For  p,  T  ^  Pmmm, 

P  =  P?P2P3’  ^=P\P2p'h 

define 

<p,r*  >= 

Associate  with  the  function  /  €  L[A),  the  column  vector  fo  of  length  K  =  8NML  by  listing 
/(d,  02, 03),  antilexi cographic  ordering  of  (ci,  02, 03)  €  A.  Also  define  the  vectors  fsj,  0  <  j  <  7 
by  listing  /(58i(oi,a2,a3)),  in  order  of  (01,02,03)  €  A.  The  the  generalized  periodizations  of  / 
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with  respect  to  Pmmm  can  be  implemented  by  the  vector  additions 

(69) 


where  F{2)  denotes  the  2-point  FT  matrix, 

m  = 

and  Ik  is  the  K  x  K  identity  matrix. 

Crystallographic  group  P2  [13]  is  a  subgroup  of  Pmmm, 

P2  =  {1,S24}. 

P2^  —  "[  1 )  S24  j  S32  j  S5g  ]■ . 

If  /  G  L{A)  is  P2-invariant,  then  4  of  the  periodizations  vanish.  Each  of  the  non-vanishing 
periodizations  are  Pmmm-invariant  up  to  multiplication  by  ±1,  and  FT  is  computed  with  this 
invariance. 

Another  crystallographic  subgroup  of  Pmmm  is  P222. 

P222  —  "[I,  S24j  S40, 54s}. 

P222.  =  {1,556}. 

For  P222-invariant  /,  all  the  periodizations  except  /«•  and  vanish. 

If  /  is  Pmmm-invariant,  then  computation  is  carried  out  only  for  /sj. 

E.8.3  Extended  CT  FFT  :  abelian  afiine  group 

The  discussion  of  section  E.8.1  will  be  extended  to  abelian  subgroups  X  of  Aff{A)  of  the 
form  X  =  B  X  K  where  P  is  a  subgroup  of  A  and  K  is  a  subgroup  of  Aut{A).  The  CT  FFT 
algorithm  associated  to  X  combines  features  of  the  standard  CT  FFT  associated  to  B  and  the 


’  f.-  ' 

'  f.o  ' 

f.3 

^516 

=  [P(2)  ®  Ik  ®  F{2)  ®  F{2)  ®  Ik] 

^524 

^532 

f 

^540 

^^4*8 

f 

^S48 

f 

^556 
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abelian  point  group  CT  FFT  associated  to  K.  The  pseudo-periodizations  are  now  taken  with 
respect  to  the  affine  subgroup  X.  The  motivation  is  to  unify  the  writing  of  FT  code  for  affine 
group  invariant  data. 

Choose  an  abelian  subgroup  X  of  Aff(A)  of  the  form  X  =  BxK.  Then  X*  =  5*  x  K\ 
We  will  usually  write  bk  for  {b,k)  and  b*k*  for  (6”,^*).  Denote  a  complete  set  of  5-‘--coset 

representatives  by 

z(6*)  =:  (70) 

For  /  G  L{A),  define  the  pseudo-periodizations  €  L{A),  x*  €  X*,  by 

fx-{a)  =  ^  <x,x*  >,  a  e  A,  x*  €  X*.  (71) 

/j..(x(a))  =  <  x,x*  >/ir*(a),  a  e  A,  X*  e  X*.  (72) 


Since 


/  = 


1 

o(X) 


y)  fx-- 

X*6X* 


we  can  compute  F(j>f  by  the  collection  of  FT  computations 


(73) 


F<i>fx 


G  X* 


A  direct  computation  shows  that  /^.  satisfies  the  group  invariance  with  character  condition.  In 
particular 

/..(6  +  a)  =  <6,4->/..(o).  b€B,x-  =  b-k'eX'.  (74) 

Define  px*  G  L{A)^  x*  G  X*,  by 

gx>{a)  =  fx‘{a)  <  >>  o  €  A,  X*  =  b*k*.  (75) 


Px*  is  5-invariant  and  can  be  viewed  as  a  function  in  L{Af  B). 

Theorem  E.15  For  x*  =  b*k*  G  X*,  F^fx^  vanishes  off  of  z{b*)  -}-  B^  and  we  have 

F,l,fx*{z{b*)  -f  b'^)  =  o{B)F^^Px’{b  ),  b  G  5  . 


Proof  Choose  a  complete  system  of  representatives  for  the  5-cosets  in  A 
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Setting 

c  =  z{b\)  +  b'^,  b^  €  B*,  b'^  €  B'^ 

,  a  =  mj  +  6,  I  J  Ji  b  £  B, 
in 

/(“)  <  “>  > 

aeA 

we  have,  applying  Eq.  (74) 

F<i>f{c)  =  S  +  b^)  >  '^  <  b,bi  -  b*  > 

j=i  beB 

which  vanishes  unless  6^  =  6*,  proving  F^f  vanishes  olf  of  z{b  )  +  5  .  Then  by  theorem  E.6, 

j 

F^f{z{b*)  +  b^)  =  o{B)J^g^-{mj)  <  mj,<t>{b^)  >, 

i=i 

completing  the  proof  of  the  theorem. 

For  h*  €  B*  define 

5(6*)  =  :  F  €  K*).  (76) 

By  theorem  E.15 

F^/(z(6*)  +  6^)  = -T?!  E  (77) 

k'eK’ 

which  implies  that  F^f  on  the  coset 

z{b*)  +  B^,  b*eB*, 

is  determined  by  the  induced  FT  of  functions  in  S{b*). 

The  pseudo-periodization  operations  introduce  data  redundancies  which  we  will  now  de¬ 
scribe. 

Set  C  =  Aj B.  K  acts  by  the  identity  mapping  on  B  and  induces  a  group  of  automorphisms 
of  C  denoted  also  by  K.  For  b*  G  B*  and  k  G  F,  there  exists  a  unique  (b.{k)  G  B^  such  that 

k*{z{b*))  =  z{b*)  +  Ci^{k).  (78) 

Direct  computation  shows  that 

k*{Ch-{.y))  +  C6*(^)  =  ^b-{kk'),  k,  k'  G  K.  (79) 


Define 


Q-{k)  =  <h{(h’{k))  €  C‘. 


(80) 
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Theorem  E.16  For  x*  =  h* k*  €  X*  and  k  €  K , 

^^.(k(c))  =  <  k,k*  >  <  c,C6*(«)  >  9x‘{c),  cGC.  (81) 

=<«,«:*>  eB^^kE  K.  (82) 

Proof  By  Eqs.  (72),  (75)  and  (78) 

gx*{K(a)  =  <  K,K*  >  <  K,{a),<f>{z{b*))  >  fx‘{a),  a  E  A,  k  E  K 

=  <  K,K*  >  <  a,(t>{K*{z{b*)))  >  fx-{a) 

=  <  K,K*  >  <  a,(J){Cb-{l^))  >  gx-{(^)- 

The  second  statement  can  be  proved  by  usual  arguments.  A  modified  RT  algorithm  can  be 
applied  to  the  induced  FT  computations. 

For  a  subgroup  F  of  X,  set 

F.  =  {x*  E  X*  :<  y,x*  >=  1,  for  ally  6  F}.  (83) 

Arguing  as  in  theorem  F.14,  we  have  the  following  theorem. 

Theorem  E.17  If  X  is  a  subgroup  ofAff{A)  and  Y  is  a  subgroup  of  X,  then  for  Y -invariant 
f  E  L[A),  the  pseudo-periodizations  fx*,  x*  E  X*  vanishes  unless  x  eY^,. 

Affine  group  CT  FFT  code  for  X  can  be  used  to  compute  the  FT  of  F-invariant  data,  for 
any  subgroup  F  of  X .  In  several  important  applications,  the  group  X  can  be  chosen  such  that 
the  corresponding  CT  FFT  algorithm  can  be  implemented  by  simple  1-D  routines  while  more 
complicated  code  is  required  for  a  direct  implementation  of  the  FT  of  F-invariant  data,  F. 

E.8.4  CT  FFT  with  respect  to  Fmmm 

We  will  continue  with  the  notations  established  in  example  F.4. 

Fmmm  =  B  x  Pmmm. 

We  will  use  the  .B-periodization  computation  of  example  F.ll  as  the  first  stage  of  the  two  stage 
pseudo-periodizations  with  respect  to  Fmmm.  Recall  the  ordering  of  the  elements  of  Fmmm 
given  in  example  E.4  : 


B  =  {so,  Sl,  52,53,  S4,Ss,S6,  •Sr}- 
Pmmm  =  {50, 58, 5i6, 534,  -532,  •S40,  -S48,  -sse}- 
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Fmmm  =  {ssi+k  :  0  <  k,l  <  7}. 

For  (aia2«3)  €  A,  observe  that 

•S8((®i)  “2)^3)  =  Ssi+i{ai,a2,az)  +  si,  si  6  B. 

In  example  E.ll,  periodizations 

A-,  0</<7 

are  made  on  the  collection  of  5-coset  representatives 

C  =  {(oi,a2,a3)  :  0  <  a,-  <  iVj,  i  =  1,2,3}. 

7  7 

^  ^2  /("^SnO  + -Sm)  <  -^m,  •Sfc  ><  "SSn, -Sg;  > 
n=0  771=0 
7 

=  m  fblfi^Sna)  <  SSn,  > 

n=0 

7 

=  /^^/(•SSn-t-nO)  <  -Sn,  ><  -Ssn,  -Ss;  > 

77=0 

CT  FFT  with  respect  to  Fmmm  was  implemented  on  a  Sun4  station  [1]. 

E.9  Incorporating  ID  symmetries  in  FFT 

We  have  developed  various  FFT  algorithms  incorporating  certain  ID  symmetry.  In  this  section, 
we  give  an  example  of  incorporating  invariance  conditions  in  data  without  giving  up  the  use  of 
highly  efficient  FFT  routines. 

Set  A  =  ZIN,  for  a  natural  number  N.  For  /  €  L{A),  the  invariance  conditions  we  will 
consider  here  are 


/(a)  =  ±f{-a).  (84) 


An  efficient  algorithm  was  given  by  Cooley  et  al.  [10]  and  Rabiner  [16]  which  reduced 
the  computation  to  that  for  an  A^/2-point  FFT  with  preprocessing  and  postprocessing.  The 
procedures  are  summarized  as  follows. 


a.  Compute 


/V/4-1 

no)  =  2  E  /(2a  +  l)- 


a=0 


b.  For  a  =  1, 2, ...,  N/i  -  1,  formulate  the  sequence  g{a)  as 
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g(a)  =  /(2c)  +  [/(2a  +  1)  — /(2a  —  1)], 

g{Nl2  —  a)  =  /(2a)  —  [/(2a  +  1)  — /(2a  —  1)], 

^(0)  =  /(O), 

^(iV/4)  =  /(7V/2). 


c.  Take  the  A'/2-point  FFT  of  g{a)\  call  this  result  G{b). 

d.  Form  two  sequences 


U{h) 

V{b) 


=  Re[Gib)],  6  =  0,1,2,..., 
ImlGjb)]  2, 

2sin{2TrblN)’  ’ 


iV/4, 

..,iV/4- 


1. 


e.  For  6  =  1, 2, ...,  N/A,  the  transformed  data  sequence  F{b)  is  given  as 


^(6)  =  U{b)  +  V{b), 

F{Nl2-b)  =  U{b)-V{b), 

F{0)  =  U{0)  +  V{0), 

F{N/2)  =  U{0)-V{0). 

Notice  that  in  step  d,  the  computation  involves  division  by  {sin{2Trb/N)}.  This  may  cause 
stability  problem  for  large  size  N. 

We  summarize  here  an  algorithm  proposed  in  [15],  to  overcome  the  stability  problem, 

a.  Form  two  sequences 

h{a)  =  /(a)  +  /(iV/2-a),  a  =  0, 1, 2, ...,  A^/4, 

g{a)  =  [/(a)  - /(^/2  -  a)]cos(27ra/iV),  c  =  0, 1, 2, ...,  yV/4, 


both  h{a)  and  5(a)  have  invariance  conditions. 

b.  Take  the  iV/2-point(half  size)  symmetric  FT  of  h{a)  and  ^^(a). 

c.  The  transformed  data  sequence  F{b)  is  given  as 

^(26)  =  H{b),  6  =  0,l,2,...,yV/4-l, 

F{1)  =  G'(O), 

F{2b+1)  =  2G{b)-F{2b-l),  6=  l,2,...,iV/4-l. 
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This  algorithm  can  be  recursively  used  for  transform  size  of  =  2™  or  iV  =  2^1,  where 
m  >  1  and  /  an  odd  number. 

In  step  a,  multiplications  by  {cos{2iralN)}  are  required  to  formulate  ^^(a).  If,  however,  N  is 
twice  an  odd  number,  then  an' alternative  procedure,  based  on  the  Good-Thomas  prime  factor 
algorithm  [12,  18],  can  be  used  to  avoid  these  multiplications.  In  this  case,  the  computational 

procedures  can  be  stated  as 

a.  Take  the  iV/2-point  (half  size)  symmetric  FFT  of  /i(a)  =  /(2a)  and  /2(a)  =  /(Af/2  +  2o); 
call  them  Fi{b)  and  F2{b)  respectively. 

Pqj.  5  =  0, 1, 2, ...,  (A^/2  -  l)/2,  the  transformed  data  sequence  F{b)  is  given  as 

F{2b)  =  F{N-2b)  =  Fi{2b)  +  F2{2b), 

F{NI2  +  2b)  =  F{N/2-2b)  =  Fi{2b)- F2{2b). 

If  the  data  is  real,  the  same  algorithm  can  be  used  with  half  size  real  FFTs.  The  saving  in 
FFT  computation  will  be  approximately  50  percent  in  comparison  with  complex  data. 
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